DISTRIBUTION  STATEMENT  A 

Approved  for  Public  Release 
Distribution  Unlimited 


Approximate  Solutions  to 
Markov  Decision  Processes 


Geoffrey  J.  Gordon 
June  1999 
CMU-CS-99-143 


School  of  Computer  Science 
Carnegie  Mellon  University 
Pittsburgh,  PA  15213 


Submitted  in  partial  fulfillment  of  the  requirements 
for  the  degree  of  Doctor  of  Philosophy, 


Thesis  Committee: 

Tom  Mitchell,  Chair 
Andrew  Moore 
John  Lafferty 

Satinder  Singh  Baveja,  AT&T  Labs  Research 


r 

©  Copyright  Geoffrey  J.  Gordon,  1999 


This  research  is  sponsored  in  part  by  the  Defense  Advanced  Research  Projects 
Agency  (DARPA)  under  Contract  Nos.  F30602-97-1--0215  and  F336 15-93- 1-1 330,  by 
the  National  Science  Foundation  (NSF)  under  Grant  No.  BES-9402439,  and  by  an 
NSF  Graduate  Research  Fellowship.  The  views  and  conclusions  expressed  in  this 
publication  are  those  of  the  author  and  should  not  be  interpreted  as  representing  the 
official  policies,  either  expressed  or  implied,  of  DARPA,  NSF,  or  the  U.S.  government. 


DTIC  QUALEPy  INSPECTED  4 


Keywords:  machine  learning,  reinforcement  learning,  dynamic  program¬ 
ming,  Markov  decision  processes  (MDPs),  linear  programming,  convex  program¬ 
ming,  function  approximation,  worst-case  learning,  regret  bounds,  statistics, 
fitted  value  iteration,  convergence  of  numerical  methods 


School  of  Computer  Science 


DOCTORAL  THESIS 
in  the  field  of 

COMPUTER  SCIENCE 


Approximate  Solutions  to  Markov  Decision 

Processes 

GEOFFREY  J.  GORDON 


Submitted  in  Partial  Fulfillment  of  the  Requirements 
for  the  Degree  of  Doctor  of  Philosophy 


Abstract 

One  of  the  basic  problems  of  machine  learning  is  deciding  how  to  act  in  an 
uncertain  world.  For  example,  if  I  want  my  robot  to  bring  me  a  cup  of  coffee,  it 
must  be  able  to  compute  the  correct  sequence  of  electrical  impulses  to  send  to 
its  motors  to  navigate  from  the  coffee  pot  to  my  office.  In  fact,  since  the  results 
of  its  actions  are  not  completely  predictable,  it  is  not  enough  just  to  compute 
the  correct  sequence;  instead  the  robot  must  sense  and  correct  for  deviations 
from  its  intended  path. 

In  order  for  any  machine  learner  to  act  reasonably  in  an  uncertain  environ¬ 
ment,  it  must  solve  problems  like  the  above  one  quickly  and  reliably.  Unfortu¬ 
nately,  the  world  is  often  so  complicated  that  it  is  difficult  or  impossible  to  find 
the  optimal  sequence  of  actions  to  achieve  a  given  goal.  So,  in  order  to  scale 
our  learners  up  to  real-world  problems,  we  usually  must  settle  for  approximate 
solutions. 

One  representation  for  a  learner’s  environment  and  goals  is  a  Markov  decision 
process  or  MDP.  MDPs  allow  us  to  represent  actions  that  have  probabilistic 
outcomes,  and  to  plan  for  complicated,  temporally-extended  goals.  An  MDP 
consists  of  a  set  of  states  that  the  environment  can  be  in,  together  with  rules 
for  how  the  environment  can  change  state  and  for  what  the  learner  is  supposed 
to  do. 

One  way  to  approach  a  large  MDP  is  to  try  to  compute  an  approximation 
to  its  optimal  state  evaluation  function,  the  function  which  tells  us  how  much 
reward  the  learner  can  be  expected  to  achieve  if  the  world  is  in  a  particular 
state.  If  the  approximation  is  good  enough,  we  can  use  a  shallow  search  to  find 
a  good  action  from  most  states.  Researchers  have  tried  many  different  ways 
to  approximate  evaluation  functions.  This  thesis  aims  for  a  middle  ground, 
between  algorithms  that  don’t  scale  well  because  they  use  an  impoverished  rep¬ 
resentation  for  the  evaluation  function  and  algorithms  that  we  can’t  analyze 
because  they  use  too  complicated  a  representation. 
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One  of  the  basic  problems  of  machine  learning  is  deciding  how  to  act  in  an 
uncertain  world.  For  example,  if  I  want  my  robot  to  bring  me  a  cup  of  coffee,  it 
must  be  able  to  compute  the  correct  sequence  of  electrical  impulses  to  send  to 
its  motors  to  navigate  from  the  coffee  pot  to  my  office.  In  fact,  since  the  results 
of  its  actions  are  not  completely  predictable,  it  is  not  enough  just  to  compute 
the  correct  sequence;  instead  the  robot  must  sense  and  correct  for  deviations 
from  its  intended  path. 

In  order  for  any  machine  learner  to  act  reasonably  in  an  uncertain  environ¬ 
ment,  it  must  solve  problems  like  the  above  one  quickly  and  reliably.  Unfortu¬ 
nately,  the  world  is  often  so  complicated  that  it  is  difficult  or  impossible  to  find 
the  optimal  sequence  of  actions  to  achieve  a  given  goal.  So,  in  order  to  scale 
our  learners  up  to  real-world  problems,  we  usually  must  settle  for  approximate 
solutions. 

One  representation  for  a  learner’s  environment  and  goals  is  a  Markov  decision 
process  or  MDP.  MDPs  allow  us  to  represent  actions  that  have  probabilistic 
outcomes,  and  to  plan  for  complicated,  temporally-extended  goals.  An  MDP 
consists  of  a  set  of  states  that  the  environment  can  be  in,  together  with  rules 
for  how  the  environment  can  change  state  and  for  what  the  learner  is  supposed 
to  do. 

Given  an  MDP,  our  learner  can  in  principle  search  through  all  possible  se¬ 
quences  of  actions  up  to  some  maximum  length  to  find  the  best  one.  In  practice 
the  search  will  go  faster  if  we  know  a  good  heuristic  evaluation  function,  that  is, 
a  function  which  tells  us  approximately  how  good  or  bad  it  is  to  be  in  a  given 
state.  For  small  MDPs  we  can  compute  the  best  possible  heuristic  evaluation 
function.  With  this  optimal  evaluation  function,  also  called  the  value  function, 
a  search  to  depth  one  is  sufficient  to  compute  the  optimal  action  from  any  state. 

One  way  to  approach  a  large  MDP  is  to  try  to  compute  an  approximation 
to  its  value  function.  If  the  approximation  is  good  enough,  a  shallow  search  will 
be  able  to  find  a  good  action  from  most  states.  Researchers  have  tried  many 
different  ways  to  compute  value  functions,  ranging  from  simple  approaches  based 
on  dividing  the  states  into  bins  and  assigning  the  same  value  to  ail  states  in 
each  bin,  to  complicated  approaches  involving  neural  networks  and  stochastic 
approximation.  Unfortunately,  in  general  the  simple  approaches  don’t  scale  well, 
while  the  complicated  approaches  are  difficult  to  analyze  and  are  not  guaranteed 
to  reach  a  reasonable  solution. 

This  thesis  aims  for  a  middle  ground,  between  algorithms  that  don’t  scale 
well  because  they  use  an  impoverished  representation  for  the  value  function  and 
algorithms  that  we  can’t  analyze  because  they  use  too  complicated  a  represen¬ 
tation.  All  of  the  research  in  this  thesis  was  motivated  by  the  attempt  to  find 
algorithms  that  can  use  a  reasonably  rich  representation  for  value  functions  but 
are  still  guaranteed  to  converge.  In  particular,  we  looked  for  algorithms  that 
can  represent  the  value  function  as  a  linear  combination  of  arbitrary  but  fixed 
basis  functions.  While  the  algorithms  we  describe  do  not  quite  achieve  this  goal, 
they  do  represent  a  significant  advance  over  the  previous  state  of  the  art. 

There  are  three  main  parts  to  this  thesis.  In  Chapter  2  we  will  describe 
an  approach  that  lets  us  approximate  an  MDP’s  value  function  using  linear  in- 
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terpolation,  nearest-neighbor,  or  other  similar  methods.  In  Chapter  3  we  will 
step  back  and  consider  a  more  general  problem,  the  problem  of  learning  from  a 
sequence  of  training  examples  when  we  can’t  make  distributional  assumptions. 
This  chapter  will  also  serve  as  an  introduction  to  the  theory  of  convex  optimiza¬ 
tion.  Finally,  in  Chapter  4,  we  will  apply  the  theory  of  linear  programming  and 
convex  optimization  to  the  problem  of  approximating  an  MDP’s  value  function. 
Chapters  2  and  4  each  contain  experimental  results  from  different  algorithms 
for  approximating  value  functions.  In  addition  to  the  three  groups  of  results 
listed  above,  this  thesis  also  contains  references  to  related  work  (in  Chapter  5) 
and  a  concluding  summary  (in  Chapter  6), 

These  three  threads  of  research  work  together  towards  the  goal  of  finding 
approximate  value  functions  for  Markov  decision  processes.  The  contribution  of 
Chapter  2  is  the  most  direct:  it  enlarges  the  class  of  representations  we  can  use 
for  approximate  value  functions  to  include  methods  such  as  ik-nearest-neighbor, 
multilinear  interpolation,  and  kernel  regression,  for  which  there  were  previously 
no  known  convergent  algorithms.  While  Chapter  3  does  not  mention  MDPs 
directly,  it  treats  the  problem  of  learning  without  a  fixed  sampling  distribution 
or  independent  samples,  which  is  one  of  the  underlying  difficulties  in  learning 
about  MDPs.  Finally,  Chapter  4  presents  a  framework  for  designing  value 
function  approximating  algorithms  that  allow  even  more  general  representations 
than  those  of  Chapter  2. 

In  more  detail,  Chapter  2  describes  a  class  of  function  approximation  archi¬ 
tectures  (which  contains,  e.g.,  A:-nearest-neighbor  and  multilinear  interpolation) 
for  which  an  algorithm  called  fitted  value  iteration  is  guaranteed  to  converge. 
The  contributions  of  Chapter  2  include  discovering  this  class  and  deriving  con¬ 
vergence  rates  and  error  bounds  for  the  resulting  algorithms.  The  contributions 
also  include  an  improved  theoretical  understanding  of  fitted  value  iteration  via  a 
reduction  to  exact  value  iteration,  and  experimental  results  showing  that  fitted 
value  iteration  is  capable  of  complex  pattern  recognition  in  the  course  of  solving 
an  MDP. 

Chapter  3  presents  results  about  the  data  efficiency  of  a  class  of  learning 
algorithms  (which  contains,  e.^.,  linear  and  logistic  regression  and  the  weighted 
majority  algorithm)  when  traditional  statistical  assumptions  do  not  hold.  The 
type  of  performance  result  we  prove  in  Chapter  3  is  called  a  worst-case  regret 
bound,  because  it  holds  for  all  sequences  of  training  examples  and  because  it 
bounds  the  regret  of  the  algorithm  or  the  difference  between  its  performance 
and  a  defined  standard  of  comparison.  Since  one  of  the  difficulties  with  learning 
about  Markov  decision  processes  is  that  the  training  samples  are  often  not 
independent  or  identically  distributed,  better  worst-case  bounds  on  learning 
algorithms  are  a  first  step  towards  using  these  algorithms  to  learn  about  MDPs. 
The  contributions  of  Chapter  3  are  providing  a  unified  framework  for  deriving 
worst-case  regret  bounds  and  applying  this  framework  to  prove  regret  bounds 
for  several  well-known  algorithms.  Some  of  these  regret  bounds  were  known 
previously,  while  others  are  new. 

Chapter  4  explores  connections  between  the  problem  of  solving  an  MDP 
and  the  problems  of  convex  optimization  and  statistical  estimation.  It  then 
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proposes  algorithms  motivated  by  these  connections,  and  describes  experiments 
with  one  of  these  algorithms.  While  this  new  algorithm  does  not  improve  on 
the  best  existing  algorithms,  the  motivation  behind  it  may  help  with  the  design 
of  other  algorithms.  The  contributions  of  this  chapter  include  bringing  together 
results  about  MDPs,  convex  optimization,  and  statistical  estimation;  analyzing 
the  shortcomings  of  existing  value-function  approximation  algorithms  such  as 
fitted  value  iteration  and  linear  programming;  and  designing  and  experimenting 
with  new  algorithms  for  solving  MDPs. 

In  the  remainder  of  this  introduction  we  will  define  Markov  decision  processes 
and  describe  an  algorithm  for  finding  exact  solutions  to  small  MDPs.  This 
algorithm,  called  value  iteration,  will  be  our  starting  point  for  deriving  the 
results  of  Chapter  2;  and  the  underlying  motivation  for  value  iteration,  namely 
representing  the  value  function  as  the  solution  of  a  set  of  nonlinear  equations 
called  the  Bellman  equations,  will  provide  the  starting  point  for  the  results  of 
Chapter  4. 


1.1  Markov  decision  processes 

A  Markov  decision  process  is  a  representation  of  a  planning  problem.  Figure  1.1 
shows  a  simple  example  of  an  MDP.  This  MDP  has  four  states:  the  agent  starts 
at  the  leftmost  state,  then  has  the  choice  of  proceeding  to  either  of  the  two 
middle  states.  If  it  chooses  the  upper  state  it  is  charged  a  cost  of  1  unit;  if  it 
chooses  the  lower,  it  is  charged  a  cost  of  2  units.  In  either  case  the  agent  must 
then  visit  the  final  state  at  a  cost  of  1  unit,  after  which  the  problem  ends. 

The  MDP  of  Figure  1.1  is  small  and  deterministic.  Other  MDPs  may  be 
much  larger  and  may  have  actions  with  stochastic  outcomes.  For  example, 
later  on  we  will  consider  an  MDP  which  has  more  than  10^^  states.  We  are  also 
interested  in  MDPs  with  infinitely  many  states,  although  we  will  usually  replace 
such  an  MDP  by  a  finite  approximation. 

More  formally,  a  Markov  decision  process  is  a  tuple  (5,  A,  S,  c,  7, 5o).  The  set 
S  is  the  state  space;  the  set  A  is  the  action  space.  At  any  time  t,  the  environment 
is  in  some  state  xt  €  5.  The  agent  perceives  Xt,  and  is  allowed  to  choose  an 
action  at  €  A,  (If  |A|  =  1,  so  that  the  agent  has  only  one  choice  on  each  step, 
the  model  is  called  a  Markov  process  instead  of  a  Markov  decision  process.) 
More  generally,  the  available  actions  may  depend  on  xt;  if  this  is  the  case  the 
agent’s  choice  is  restricted  to  some  set  A{xt)  C  A,  The  transition  function  S 
(which  may  be  probabilistic)  then  acts  on  xt  and  at  to  produce  a  next  state 
0:^+1,  and  the  process  repeats.  The  state  xt+i  may  be  either  an  element  of  S  or 
the  symbol  O  which  signifies  that  the  problem  is  over;  by  definition  5(©,a)  =  © 
for  any  a  E  A(©).  A  sequence  of  states  and  actions  generated  this  way  is  called  a 
trajectory.  So  is  a  distribution  on  S  which  gives  the  probability  of  being  in  each 
state  at  time  0.  The  cost  function,  c  (which  may  be  probabilistic,  but  must  have 
finite  mean  and  variance),  measures  how  well  the  agent  is  doing:  at  each  time 
step  t,  the  agent  incurs  a  cost  c{xt,at).  By  definition  c(©,  a)  =  0  for  any  a.  The 
agent  must  act  to  minimize  the  expected  discounted  cost 
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Figure  1.1:  A  simple  MDP. 


7  G  [0, 1]  is  called  the  discount  factor. 

A  function  tt  which  assigns  an  action  to  every  state  is  called  a  policy.  Fol¬ 
lowing  the  policy  vr  means  performing  action  7r(i)  when  in  state  i.  If  we  write 
Paij  for  the  probability  of  reaching  state  j  when  performing  action  a  from  state 

i,  then  we  can  define  a  matrix  whose  i,yth  element  is  FV  is  called 

the  transition  probability  matrix  for  tt. 

A  deterministic  undiscounted  MDP  can  be  regarded  as  a  weighted  directed 
graph,  just  like  the  example  in  Figure  1.1:  each  state  of  the  MDP  corresponds 
to  a  node  in  the  graph,  while  each  state-action  pair  corresponds  to  a  directed 
edge.  There  is  an  edge  from  node  x  to  node  y  iflF  there  is  some  action  a  so  that 
Six,  a)  =  y;  the  weight  of  this  edge  is  c(a;,o).  In  Figure  1.1  we  have  adopted 
the  convention  that  an  edge  coming  out  of  some  node  that  points  to  nowhere 
corresponds  to  a  transition  from  that  node  to  O. 

We  can  represent  a  graph  like  the  one  in  Figure  1.1  with  an  adjacency  matrix 
and  a  cost  vector,  The  adjacency  matrix  E  has  one  row  for  each  edge  and  one 
column  for  each  node.  The  row  for  an  edge  {i,j)  has  a  -1  in  column  i  and  a 
+1  in  column  j,  and  all  other  elements  0.  The  cost  vector  c  has  one  element  for 
each  edge;  the  element  for  an  edge  (i,5(i,a))  is  equal  to  c(i,a). 

The  adjacency  matrix  E  is  related  to  the  transition  probability  matrices  P„: 
for  any  deterministic  policy  n,P„^Iisa.  submatrix  of  E.  Similarly,  c„  (defined 
to  be  the  vector  whose  ith  element  is  c(i,  7r(z)))  is  a  subvector  of  c.  In  fact,  if  we 
think  of  a  policy  as  a  subset  of  the  edges  containing  exactly  one  edge  leading 
out  of  each  state,  then  P,r  —  /  is  the  submatrix  of  E  that  results  from  deleting 
all  rows  that  correspond  to  edges  not  in  tt. 

We  can  generalize  the  idea  of  an  adjacency  matrix  to  stochastic  or  discounted 
MDPs:  the  idea  is  that  'yP-ir  - 1  should  still  always  be  a  submatrix  of  E.  So,  we 
define  P  to  be  a  matrix  with  one  row  for  every  state-action  pair  in  the  MDP. 
If  action  a  executed  in  state  i  has  probability  Patj  of  moving  the  agent  to  state 

j,  then  we  define  the  yth  entry  in  the  row  of  E  for  state  i  and  action  a  to  be 
either  jPaij  (if  i  ^  j)  or  'ypaij  —  1  (if  *  =  j).  Similarly,  we  can  generalize  the  cost 
vector  by  setting  c  to  be  the  vector  whose  element  in  the  row  corresponding  to 
state  i  and  action  a  is  E(c(?,  o)). 
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Often  we  will  write  the  adjacency  matrix  E  without  the  column  correspond¬ 
ing  to  ©  and  without  the  rows  corresponding  to  transitions  out  of  0.  This  causes 
no  loss  of  information,  since  any  missing  probability  mass  in  a  transition  may 
be  assumed  to  belong  to  0,  and  since  the  transitions  out  of  ©  are  determined 
by  the  definition  of  an  MDP. 


1.2  MDP  examples 

In  addition  to  the  simple  example  from  Figure  1,1,  Markov  decision  processes 
can  represent  much  larger,  more  complicated  planning  problems.  Some  of  the 
MDPs  that  researchers  have  tried  to  solve  are: 

•  Factory  production  planning.  In  this  problem  different  states  correspond 
to  different  inventory  levels  of  various  products  or  diflFerent  arrangements 
of  the  production  lines,  while  actions  correspond  to  possible  rearrange¬ 
ments  of  the  production  lines.  The  cost  function  includes  money  spent 
on  raw  materials,  rent  paid  for  warehouse  space,  and  profits  earned  from 
selling  products. 

•  Control  of  a  robot  arm.  In  this  problem  the  state  encodes  the  position, 
joint  angles,  and  joint  velocities  of  the  arm,  as  well  as  the  locations  of 
obstacles  in  the  workspace.  Actions  specify  joint  torques,  and  the  cost 
function  includes  bonuses  for  bringing  the  arm  close  to  its  target  configu¬ 
ration  and  penalties  for  collisions  or  jerky  motion. 

•  Elevator  scheduling.  The  state  for  this  problem  includes  such  information 
as  the  locations  of  the  elevators  and  whether  each  call  button  has  been 
pressed.  The  actions  are  to  move  the  elevators  from  floor  to  floor  and 
open  and  close  their  doors,  and  the  cost  function  penalizes  the  learner  for 
making  people  wait  too  long  before  being  picked  up  or  let  off. 

•  The  game  of  Tetris.  We  discuss  this  MDP  in  more  detail  in  Chapter  4.  Its 
state  includes  the  current  configuration  of  empty  and  filled  squares  on  the 
board  as  well  as  the  type  of  piece  to  be  placed  next.  The  actions  specify 
where  to  place  the  current  piece,  and  the  reward  for  each  transition  is 
equal  to  the  change  in  the  player’s  score. 

These  MDPs  are  all  too  large  to  solve  exactly;  for  example,  the  version  of  Tetris 
we  describe  in  Chapter  4  has  more  than  10^^  states,  while  the  robot  arm  control 
problem  has  infinitely  many  states  because  it  includes  real- valued  variables  such 
as  joint  angles. 


1.3  Value  iteration 

If  our  MDP  is  sufficiently  small,  we  can  find  the  exact  optimal  controller  by 
any  of  several  methods,  for  example  value  iteration,  policy  iteration,  or  linear 
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programming  (see  [Ber95]).  These  methods  are  based  on  computing  the  so- 
called  value,  evaluation,  or  cost-to-go  function,  which  is  defined  by  the  recursion 

v{x)  =  minE(c(x,a)  jv{6{x,a))) 

a^A 

(If  7  =  1  we  will  need  to  specify  one  or  more  base  cases  such  as  u(©)  =  0  to 
define  a  unique  value  function.)  This  recursion  is  called  the  Bellman  equation. 
If  we  know  the  value  function,  it  is  easy  to  compute  an  optimal  action  from  any 
state:  any  a  which  achieves  the  minimum  in  the  Bellman  equation  will  do.  For 
example,  the  value  function  for  the  MDP  of  Figure  1.1  is  {x,  y,  z,  g)  =  (2, 1, 1, 0). 
The  edge  from  x  toy  achieves  the  minimum  in  the  Bellman  equation,  while  the 
edge  from  x  to  z  does  not;  so,  the  optimal  action  from  state  x  is  to  go  to  y. 

Value  iteration  works  by  treating  the  Bellman  equation  as  an  assignment. 
That  is,  it  picks  an  arbitrary  initial  guess  and  on  the  ith  step  it  sets 

=  mnE(c(a;,o)  +  t;(®^((J(a:,o)))  (1.1) 

for  every  x  £  X.  For  the  special  case  of  deterministic  undiscounted  MDPs,  the 
problem  of  finding  an  optimal  controller  is  just  the  single-destination  minimum- 
cost  paths  problem,  and  value  iteration  is  called  the  Bellman-Ford  algorithm. 

To  save  writing  one  copy  of  Equation  1.1  for  each  state,  we  define  the  vector 
operator  T  so  that 

«(*+!)  =  T(u(’)) 

In  other  words,  T  performs  one  step  of  value  iteration  on  its  argument,  updating 
the  value  of  every  state  in  parallel  according  to  the  Bellman  equation.  A  step 
of  value  iteration  is  called  a  backup,  and  T  is  called  the  backup  operator. 

A  greedy  policy  for  a  given  value  function  is  one  in  which,  for  all  x,  n(x) 
achieves  the  minimum  in  the  right-hand  side  of  the  Bellman  equation.  Given  a 
policy  TT,  define  T„  so  that 

[Tniv)]x  =  E(c(x,7r(a:))  +  [u]tf(x,;r(a;))) 

where  the  notation  [uja,  stands  for  component  x  of  the  vector  v.  T„  is  called  the 
backup  operator  for  tt.  If  tt  is  greedy  for  v,  then  Tv  =  T„v. 

The  operator  is  affine,  that  is,  there  is  a  matrix  'yP^  and  a  vector  so 
that  T„v  =  'yPnV  +  c„.  In  fact,  P^  is  the  transition  probability  matrix  for  tt,  and 
is  the  cost  vector  for  tt.  That  is,  the  elements  of  are  the  costs  c(^x,Tr{x)) 
for  each  state  x,  while  the  row  of  P„  which  corresponds  to  state  x  contains  the 
probability  distribution  for  the  state  Xt^\  given  that  Xt  =  x  and  that  we  take 
action  ^{x). 

If  7  <  1,  the  operator  T  is  a  contraction  in  max  norm.  That  is,  if  u  and  v 
are  estimates  of  the  value  function,  then  ||ru  -  TuHoo  <  7||u  -  uHoo-  If  7  =  1, 
then  under  mild  conditions  T  is  a  contraction  in  some  weighted  max  norm.  In 
either  case,  by  the  contraction  mapping  theorem  (see  [BT89]),  value  iteration 
converges  to  the  unique  solution  of  the  Bellman  equations. 
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Chapter  2 

FITTED  VALUE 
ITERATION 
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Up  to  this  point,  we  have  described  how  to  find  the  exact  solution  to  a 
Markov  decision  process.  Unfortunately,  we  can  only  find  exact  solutions  for 
small  MDPs.  For  larger  MDPs,  we  must  resort  to  approximate  solutions. 

Any  approximate  solution  must  take  advantage  of  some  prior  knowledge 
about  the  MDP:  in  the  worst  case,  when  we  don’t  know  anything  about  which 
states  are  similar  to  which  others,  we  have  no  hope  of  even  being  able  to  repre¬ 
sent  a  good  approximate  solution.  Luckily,  if  we  have  to  solve  a  large  MDP  in 
practice,  we  usually  know  something  about  where  it  came  from.  For  example, 
an  MDP  with  10^®  + 1  states  is  probably  too  large  to  solve  exactly  with  current 
computers;  but  if  we  know  that  these  states  are  the  dollar  amounts  in  one-cent 
increments  between  zero  and  a  hundred  million,  we  can  take  advantage  of  the 
fact  that  a  good  action  from  the  state  $1053.76  is  probably  also  a  good  action 
from  the  state  $1053.77.  Similarly,  we  usually  can’t  solve  an  MDP  with  in¬ 
finitely  many  states  exactly,  but  if  we  know  the  states  are  the  positions  between 
0  and  Im  we  can  take  advantage  of  the  fact  that  a  motion  of  Inm  is  unlikely  to 
matter  very  much. 

The  simplest  and  oldest  method  for  finding  approximate  value  functions  is 
to  divide  the  states  of  the  MDP  into  groups,  pick  a  representative  state  from 
each  group,  and  pretend  that  the  states  in  each  group  all  have  the  same  value 
as  their  representative.  For  example,  in  the  MDP  with  states  between  0  and 
Im,  one  group  could  be  the  states  from  0  to  1cm  with  representative  0.5cm,  the 
next  could  be  the  states  from  1cm  to  2cm  with  representative  1.5cm,  and  so 
forth,  for  a  total  of  100  groups.  If  a  1cm  resolution  turned  out  to  be  too  coarse 
in  some  interval,  say  between  33cm  and  34cm,  we  could  replace  that  group  with 
a  larger  number  of  finer  divisions,  say  330mm  to  331mm,  331mm  to  332mm, 
and  so  forth,  giving  a  total  of  109  groups. 

Once  we  have  divided  the  states  into  groups  we  can  run  value  iteration  just 
as  before.  If  we  see  a  transition  that  ends  in  a  non-representative  state,  say 
one  that  takes  us  to  the  state  1.6cm,  we  look  up  the  value  of  the  appropriate 
representative,  in  this  case  1.5cm.  This  way  we  only  have  to  store  and  update 
the  values  for  the  representative  states,  which  means  that  we  only  have  to  pay 
attention  to  transitions  that  start  in  the  representative  states.  So,  value  iteration 
will  run  much  faster  than  if  we  had  to  examine  all  of  the  values  and  all  of  the 
transitions. 

This  method  for  finding  approximate  value  functions  is  called  state  aggrega¬ 
tion.  It  can  work  well  for  moderate-sized  MDPs,  but  it  suffers  from  a  problem: 
if  we  choose  to  divide  each  axis  of  a  d-dimensional  continuous  state  space  into 
k  partitions,  we  will  wind  up  with  states  in  our  discretization.  Even  if  k  and 
d  are  both  relatively  small  we  can  wind  up  with  a  huge  number  of  states.  For 
example,  if  we  divide  each  of  six  continuous  variables  into  a  hundred  partitions 
each,  the  result  is  10^^  distinct  states.  This  problem  is  called  the  curse  of  di¬ 
mensionality,  since  the  number  of  states  in  the  discretization  is  exponential  in 
d. 

To  avoid  the  curse  of  dimensionality,  we  would  like  to  have  an  algorithm 
that  works  with  more  flexible  representations  than  just  state  aggregation.  For 
example,  rather  than  setting  a  state’s  value  to  that  of  a  single  representative, 
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we  might  prefer  to  interpolate  linearly  between  a  pair  of  neighboring  represen¬ 
tatives;  or,  in  higher  dimensions,  we  might  want  to  set  a  state’s  value  to  the 
average  of  the  k  nearest  representatives.  This  kind  of  flexibility  can  let  us  get 
away  with  fewer  representatives  and  so  solve  larger  problems. 

One  algorithm  that  can  take  advantage  of  such  representations  is  fitted  value 
iteration,  which  is  the  subject  of  this  chapter.  Fitted  value  iteration  general¬ 
izes  state  aggregation  to  handle  representations  like  linear  interpolation  and 
A:-nearest-neighbor. 

In  fitted  value  iteration,  we  interleave  steps  of  value  iteration  with  steps  of 
function  approximation.  It  will  turn  out  that,  if  the  function  approximator  sat¬ 
isfies  certain  conditions,  we  will  be  able  to  prove  convergence  and  error  bounds 
for  fitted  value  iteration.  If,  in  addition,  the  function  approximator  is  linear  in 
its  parameters,  we  will  be  able  to  show  that  fitted  value  iteration  on  the  original 
MDP  is  equivalent  to  exact  value  iteration  on  a  smaller  MDP  embedded  within 
the  original  one. 

The  conditions  on  the  function  approximator  allow  such  widely-used  meth¬ 
ods  as  A;-nearest-neighbor,  local  weighted  averaging,  and  linear  and  multilinear 
interpolation;  however,  they  rule  out  all  but  special  cases  of  linear  regression, 
local  weighted  regression,  and  neural  net  fitting.  In  later  chapters  we  will  talk 
about  ways  to  use  more  general  function  approximators. 

Most  of  the  material  in  this  chapter  is  drawn  from  [Gor95a]  and  [Gor95b]. 
Some  of  this  material  was  discovered  simultaneously  and  independently  in  [TV94]. 
A  related  algorithm  which  learns  online  (that  is,  by  following  trajectories  in  the 
MDP  and  updating  states  only  as  they  are  visited,  in  contrast  to  the  way  fitted 
value  iteration  can  update  states  in  any  order)  is  described  in  [SJJ95]. 


2.1  Discounted  processes 

In  this  section,  we  will  consider  only  discounted  Markov  decision  processes. 
Section  2.2  generalizes  the  results  to  non  discounted  processes. 

Suppose  that  is  the  parallel  value  backup  operator  for  a  Markov  decision 
process  M,  as  defined  in  Chapter  1.  In  the  basic  value  iteration  algorithm,  we 
start  off  by  setting  vq  to  some  initial  guess  at  M’s  value  function.  Then  we 
repeatedly  set  Vi^i  to  be  TMivi)  until  we  either  run  out  of  time  or  decide  that 
some  Vn  is  a  sufiiciently  accurate  approximation  to  M’s  true  value  function  v*. 
Normally  we  would  represent  each  Vi  as  an  array  of  real  numbers  indexed  by 
the  states  of  M;  this  data  structure  allows  us  to  represent  any  possible  value 
function  exactly. 

Now  suppose  that  we  wish  to  represent  Vi,  not  by  a  lookup  table,  but  by 
some  other  more  compact  data  structure  such  as  a  piecewise  linear  function. 
We  immediately  run  into  two  difficulties.  First,  computing  TM{vi)  generally 
requires  that  we  examine  Vi{x)  for  nearly  every  x  in  M’s  state  space;  and  if 
M  has  enough  states  that  we  can’t  afford  a  lookup  table,  we  probably  can’t 
afford  to  compute  Vi  that  niany  times  either.  Second,  even  if  we  can  represent 
Vi  exactly,  there  is  no  guarantee  that  we  can  also  represent  Tuivi), 
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To  address  these  difficulties,  we  will  assume  that  we  have  a  sample  Xq  C  S  of 
states  from  M’s  state  space  5.  Xq  should  be  small  enough  that  we  can  examine 
each  element  repeatedly;  but  it  should  be  representative  enough  that  we  can 
learn  something  about  M  by  examining  only  states  in  Xq.  Now  we  can  define 
the  fitted  value  iteration  algorithm.  Rather  than  setting  Ui+i  to  TMivi),  we  will 
first  compute  only  for  x  G  Xq]  then  we  will  fit  our  piecewise  linear 

function  (or  other  approximator)  to  these  training  values  and  call  the  resulting 
function  Vi-^i . 


2 .1.!  Approximators  as  mappings 

In  order  to  reason  about  fitted  value  iteration,  we  will  consider  function  approx¬ 
imators  themselves  as  operators  on  the  space  of  value  functions.  By  a  function 
approximator  we  mean  a  deterministic  algorithm  A  which  takes  as  input  the 
target  values  for  the  states  in  Xq  and  produces  as  output  an  intermediate  rep¬ 
resentation  which  allows  us  to  compute  the  fitted  value  at  any  state  x  £  S.  In 
this  definition  the  states  in  the  sample  Xo  are  fixed,  so  changing  Xq  results  in 
a  different  function  approximator. 

In  order  to  think  of  the  algorithm  A  as  an  operator  on  value  functions,  we 
must  reinterpret  ^’s  input  and  output  as  functions  from  S  to  E.  By  doing  so, 
we  will  define  a  mapping  associated  with  A^  Ma  :  (5  i->-  E)  (5  E);  the 

input  to  Ma  will  be  a  function  /  that  represents  the  training  values  for  A,  while 
the  output  of  Ma  will  be  another  function  /  =  Ma(/)  that  represents  the  fitted 
values  produced  by  A, 

If  there  are  m  states  in  the  sample  Xq,  then  the  input  to  A  is  a  vector  of  m 
real  numbers.  Equivalently,  the  input  is  a  function  /  from  Xq  to  E:  the  target 
value  for  state  x  is  /(x).  Since  the  sample  Xo  is  a  subset  of  the  state  space  5, 
we  can  extend  /  to  take  arguments  in  all  of  S  by  defining  f{y)  arbitrarily  for 
y  ^  Xq.  This  extended  /  is  what  Ma  will  take  as  input. 

With  this  definition  for  /,  it  is  easy  to  see  how  to  define  /;  for  any  x  €  5, 
/(x)  is  just  the  fitted  value  at  state  x  given  the  training  values  encoded  in  /. 
So,  Ma  will  take  the  training  values  at  states  x  G  Xq  as  input  (encoded  as  a 
function  /  :  S  — >  E  as  described  in  the  previous  paragraph),  and  produce  the 
approximate  value  function  /  as  output. 

In  the  above  definition,  it  is  important  to  distinguish  the  target  function  / 
and  the  learned  function  /  from  the  mapping  Ma'-  the  former  are  real- valued 
functions,  while  the  latter  is  a  function  from  functions  to  functions.  It  is  also 
important  to  remember  that  Ma  is  a  deterministic  function:  since  Xq  is  fixed 
and  /  is  deterministic,  there  is  no  element  of  randomness  in  selecting  ^’s  training 
data.  Finally,  although  Ma  appears  to  require  a  large  amount  of  information  as 
input  and  produce  a  large  amount  of  information  as  output,  this  appearance  is 
misleading:  Ma  ignores  most  of  the  information  in  its  input,  since  MaU)  does 
not  depend  on  /(x)  for  x  ^  Xq,  and  most  of  the  information  in  /  =  M>i(/) 
is  redundant,  since  by  assumption  /(x)  can  be  computed  for  any  x  from  the 
output  of  algorithm  A. 
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Figure  2.1:  An  example  of  the  mapping  for  a  function  approximator. 

Figure  2.1  shows  the  mapping  associated  with  a  simple  function  approxima¬ 
tor.  In  this  figure  the  MDP  has  only  two  states,  x  and  y,  both  of  which  are 
in  the  sample  Xo.  The  function  approximator  has  a  single  adjustable  param¬ 
eter  (call  it  a)  and  represents  v{x)  =  v{y)  =  a.  The  algorithm  for  finding  the 
parameter  a  from  the  training  data  v{x)  and  v{y)  is  a  4-  {v{x)  +  v(y))/2. 

The  set  of  possible  value  functions  for  a  two-state  MDP  is  equivalent  to 
so  each  point  plotted  in  Figure  2.1  corresponds  to  a  different  possible  value 
function.  For  example,  the  point  (1,5)  corresponds  to  the  value  function  that 
has  v{x)  =  1  and  v{y)  =  5.  The  set  of  functions  that  the  approximator  can 
represent  is  shown  as  a  thick  line;  these  are  the  functions  with  v(x)  =  v{y).  The 
operator  Ma  maps  an  input  (target)  value  function  in  to  an  output  (fitted) 
value  function  on  the  hne  v{x)  =v(y): 

Figure  2.2  illustrates  the  mapping  associated  with  a  slightly  more  compli¬ 
cated  function  approximator.  In  this  figure  the  state  space  of  the  MDP  is  an 
interval  of  the  real  line,  and  the  sample  is  Xo  =  {1,2, 3, 4, 5}.  The  function 
approximator  has  two  adjustable  parameters  (call  them  a  and  b)  and  represents 
the  value  of  a  state  with  coordinate  x  as  v{x)  =  ax  -I-  b.  The  algorithm  for 
finding  a  and  b  fi-om  the  training  data  is  linear  regression. 

The  left  column  of  Figure  2.2  shows  two  possible  inputs  to  Ma,  while  the 
right  column  shows  the  corresponding  outputs.  Both  the  inputs  and  the  outputs 
are  functions  from  the  entire  state  space  to  E,  but  the  input  functions  are  plotted 
only  at  the  sample  points  to  emphasize  that  Ma  does  not  depend  on  their  value 
at  any  states  x  ^  Xq. 

With  our  definition  of  Ma  ,  we  can  write  the  fitted  value  iteration  algorithm 
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Figure  2.2:  Another  example  of  the  mapping  for  a  function  approximator. 


as  follows.  Given  an  initial  estimate  vq  of  the  value  function,  we  begin  by 
computing  Ma{vo)j  an  approximate  representation  of  uq.  Then  we  alternately 
apply  Tm  and  Ma  to  produce  the  series  of  functions 

Vo,MA(vQ),TM{MAivo)),MA{TMiMA{vo))),  •  .  . 

(In  an  actual  implementation,  only  the  functions  M>i(. . .)  would  be  represented 
explicitly;  the  functions  Tm(-  *  •)  would  just  be  sampled  at  the  points  Xq.)  Fi¬ 
nally,  when  we  satisfy  some  termination  condition,  we  return  one  of  the  functions 

The  characteristics  of  the  mapping  Ma  determine  how  it  behaves  when  com¬ 
bined  with  value  iteration.  Figure  2.3  illustrates  one  particularly  important 
property.  As  the  figure  shows,  linear  regression  can  exaggerate  the  difference 
between  two  target  functions:  a  small  difference  between  the  target  functions 
/  and  g  can  lead  to  a  larger  difference  between  the  fitted  functions  /  and  g. 
For  example,  while  the  two  input  functions  in  the  left  column  of  the  figure  dif¬ 
fer  by  at  most  1  at  any  state,  the  corresponding  output  functions  in  the  right 
column  differ  by  |  at  or  =  3.  Many  function  approximators,  such  as  neural 
nets  and  local  weighted  regression,  can  exaggerate  this  way;  others,  such  as 
A:-nearest-neighbor,  can  not. 

This  sort  of  exaggeration  can  cause  instability  in  a  fitted  value  iteration 
algorithm.  By  contrast,  we  will  show  below  that  approximators  which  never 
exaggerate  can  always  be  conibined  safely  with  value  iteration. 

More  precisely,  we  will  say  that  an  approximator  exaggerates  the  difference 
between  two  target  functions  /  and  g  if  the  fitted  functions  /  =  Af^(/)  and 
g  =  MAig)  are  farther  apart  in  max  norm  than  /  and  g  were.  Then  the 
approximators  which  never  exaggerate  are  exactly  the  ones  whose  mappings  are 
nonexpansions  in  max  norm:  by  definition,  if  Ma  is  a  nonexpansion  in  max 


Figure  2.3:  Linear  regression  on  the  sample  A"o  =  {1,2,3}. 


norm,  then  for  any  target  functions  /  and  g  and  for  any  x  we  have 
I  fix)  -  9{x)  I  <  I  f{x)  -  g{x)  I 

Note  that  we  do^  not  require  that  f{x)  and  f{x)  be  particularly  close  to  each 
other,  nor  that  fix)  and  fiy)  be  as  close  to  each  other  as  fix)  and  fiy). 

The  above  discussion  is  summarized  in  the  following  theorem: 

Theorem  2.1  Let  Tm  he  the  parallel  value  backup  operator  for  some  Markov 
decision  process  M  with  discount  'y  <  1.  Let  A.  be  a  function  approximator  with 
mapping  M^.  Suppose  Ma  is  a  nonexpansion  in  max  norm.  Then  Ma^Tm  has 
contraction  factor  7;  so  the  fitted  value  iteration  algorithm  based  on  A  converges 
in  max  norm  at  the  rate  7  when  applied  to  M, 

Proof:  We  saw  above  that  Tm  is  a  contraction  in  max  norm  with  factor  7. 
By  assumption,  Ma  is  a  nonexpansion  in  max  norm.  Therefore  Ma  o  Tm  is  a 
contraction  in  max  norm  by  the  factor  7.  □ 

One  might  wonder  whether  the  converse  of  Theorem  2.1  is  true,  that  is, 
whether  the  convergence  of  fitted  value  iteration  with  approximator  A  for  all 
MDPs  implies  that  Ma  is  a  max-norm  nonexpansion.  We  do  not  know  the 
answer  to  this  question,  but  if  we  add  weak  additional  conditions  on  A  we  can 
prove  a  theorem.  See  Section  2.7.1. 

2.1.2  Averagers 

Theorem  2.1  raises  the  question  of  which  function  approximators  can  exaggerate 
and  which  can  not.  Unfortunately,  many  common  approximators  can.  For  ex¬ 
ample,  as  figure  2.3  demonstrates,  linear  regression  can  be  an  expansion  in  max 
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norm;  and  Boyan  and  Moore  [BM95]  show  that  fitted  value  iteration  with  lin¬ 
ear  regression  can  diverge.  Other  methods  which  may  diverge  include  standard 
feedforward  neural  nets  and  local  weighted  regression  [BM95]. 

On  the  other  hand,  many  approximation  methods  are  nonexpansions,  in¬ 
cluding  local  weighted  averaging,  fc-nearest-neighbor,  Bezier  patches,  linear  in¬ 
terpolation  on  a  mesh  of  simplices,  and  multilinear  interpolation  on  a  mesh  of 
hypercubes,  as  well  as  simpler  methods  like  grids  and  other  state  aggregation. 
In  fact,  in  addition  to  being  nonexpansions  in  max  norm,  these  methods  all  have 
two  other  important  properties.  (Not  all  nonexpansive  function  approximators 
have  these  additional  properties,  but  many  important  ones  do.) 

First,  all  of  the  function  approximation  methods  listed  in  the  previous  para¬ 
graph  are  linear  in  the  sense  that  their  mappings  are  linear  functions.  Linearity 
of  the  approximator  in  this  sense  does  not  mean  that  the  fitted  function  /  must 
be  linear;  instead,  it  means  that  for  each  x,  f{x)  must  be  a  linear  function  of 
/(a:i),/(a?2),...  for  some  €  Xq, 

Second,  all  of  these  function  approximation  methods  are  monotone  in  the 
sense  that  their  mappings  are  monotone  functions.  Again,  there  is  no  need 
for  the  fitted  function  /  to  be  monotone;  instead,  this  kind  of  monotonicity 
means  that  increasing  any  of  the  training  values  cannot  decrease  any  of  the 
fitted  values. 

We  will  call  any  function  approximator  that  satisfies  these  three  proper¬ 
ties  (linearity,  monotonicity,  and  nonexpansivity)  an  averager.  The  reason  for 
this  name  is  that  averagers  are  exactly  the  function  approximators  in  which 
every  fitted  value  f{x)  is  the  weighted  average  of  one  or  more  target  values 
/(a^i),  7(^2)?  •  •  -j  plus  a  constant  offset.  (The  weights  and  offsets  must  be  fixed, 
that  is,  they  cannot  depend  on  the  target  values.  They  can,  however,  depend 
on  the  choice  of  sample  Xq,  as  they  do  in  for  example  fc-nearest-neigbor.)  Av¬ 
eragers  were  first  defined  in  [Gor95a];  the  definition  there  is  slightly  less  general 
than  the  one  given  here,  but  the  theorems  given  there  still  hold  for  the  more  gen¬ 
eral  definition.  A  similar  class  of  function  approximators  (called  interpolative 
representations)  was  defined  simultaneously  and  independently  in  [TV94]. 

More  precisely,  if  M  has  n  states,  then  specifying  an  averager  is  equivalent 
to  picking  n  real  numbers  ki  and  nonnegative  real  numbers  Pij  such  that  for 
each  i  we  have  l^ij  <  I-  With  these  numbers,  the  fitted  value  at  the  ith 
state  is  defined  to  be 

n 

"I"  ^  ^  Pijfj 

where  fj  is  the  target  value  at  the  jth  state.  The  correspondence  between 
averagers  and  the  coefficients  0ij  and  ki  is  one-to-one  because,  first,  any  linear 
operator  Ma  is  specified  by  a  unique  matrix  and  vector  (fei);  second,  if  any 
Pij  is  negative  then  Ma  is  not  monotone;  and  third,  if  Pij  >  ^  ^ 

then  increasing  the  target  value  by  1  in  every  state  will  cause  the  fitted  value 
at  state  i  to  increase  by  more  than  1. 

Most  of  the  PijS  will  generally  be  zero.  In  particular,  will  be  zero  if 
j  ^  Xo,  In  addition,  0ij  will  often  be  zero  or  near  zero  if  states  i  and  j  are  far 
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Figure  2.4:  A  nondiscounted  deterministic  Markov  process  and  an  averager. 
The  process  is  shown  in  (a);  the  goal  is  state  1,  and  all  arc  costs  are  1  except 
at  the  goal.  In  (b)  we  see  the  averager,  represented  as  a  Markov  process:  states 
1  and  3  are  unchanged,  while  v(2)  is  replaced  by  v(3).  The  embedded  Markov 
process  is  shown  in  (c);  state  3  has  been  disconnected,  so  its  value  estimate  will 
diverge. 


apart. 

To  illustrate  the  relationship  between  an  averager  and  its  coefficients  we 
can  look  at  a  simple  example.  Consider  a  Markov  decision  process  with  five 
states,  labelled  1  through  5.  Suppose  that  the  sample  Xo  is  {1, 5},  and  that  our 
averager  approximates  the  values  of  states  2  through  4  by  linear  interpolation. 
Then  the  coefficients  of  this  averager  are  ki  =  0  and 


(Ai)  = 


1  0 
I  0 
I  0 

5  0 

V  0  0 


0  0 
0  0 
0  0 
0  0 
0  0 


0  \ 

1 

4 

1 

2 

3 

4 


1  y 


The  second  row  of  this  array,  for  example,  tells  us  that  the  fitted  value  for  state 
2  is  equal  to  three-fourths  of  the  target  value  for  state  1,  plus  one-fourth  of  the 
target  value  for  state  5.  The  fact  that  the  middle  three  columns  of  the  0  matrix 
are  zero  means  that  states  two  though  four  are  not  in  the  sample  Xq. 

In  this  example  the  coefficients  /3i,i  and  /35,5  are  both  equal  to  1,  which 
means  that  the  fitted  values  at  states  1  and  5  are  equal  to  their  target  values. 
This  property  is  not  true  of  all  averagers;  for  example,  in  fe-nearest-neighbor 
with  A;  >  1,  the  fitted  value  at  a  state  in  the  sample  is  not  equal  to  its  target 
value  but  to  the  average  of  k  different  target  values. 


2.2  Nondiscounted  processes 

If  7  =  1  in  our  MDP  M ,  Theorem  2.1  no  longer  applies:  o  Mj^  is  merely  a 

nonexpansion  in  max  norm,  and  so  is  no  longer  guaranteed  to  converge.  Fortu¬ 
nately,  there  are  averagers  which  we  may  use  with  nondiscounted  MDPs.  The 
proof  relies  on  an  intriguing  relationship  between  averagers  and  fitted  value  it¬ 
eration:  we  can  view  any  averager  as  a  Markov  process,  and  we  can  view  fitted 
value  iteration  as  a  way  of  combining  two  Markov  decision  processes. 


(a) 


(b) 


(c) 


Figure  2.5:  Constructing  the  embedded  Markov  process,  (a)  A  deterministic 
process:  the  state  space  is  the  unit  triangle,  and  on  every  step  the  agent  moves  a 
constant  distance  towards  the  origin.  The  value  of  each  state  is  its  distance  from 
the  origin,  so  v*  is  nonlinear,  (b)  A  representative  transition  from  the  embedded 
process.  For  our  averager,  we  used  linear  interpolation  on  the  corners  of  the 
triangle;  as  before,  the  agent  moves  towards  the  goal,  but  then  the  averager 
moves  it  randomly  to  one  of  the  corners.  On  average,  this  scattering  moves  the 
agent  back  away  from  the  goal,  so  steps  in  the  embedded  process  don’t  get  the 
agent  as  far.  The  value  function  for  the  embedded  process  is  a:  4-  y.  (c)  The 
expected  progress  the  agent  makes  on  each  step. 


The  Markov  process  associated  with  an  averager  has  state  space  5,  transition 
matrix  (Pij),  and  cost  vector  In  other  words,  the  state  space  is  the  same 
as  M’s,  the  probability  of  transitioning  to  state  j  given  that  the  current  state  is 
i  is  Pij,  and  the  cost  of  leaving  state  i  is  ki.  (If  ^ij  make 

up  the  difference  with  a  transition  to  the  terminal  state  ©.)  Since  the  transition 
matrix  is  (Aj),  there  is  a  nonzero  probability  of  going  from  i  to  j  if  and  only  if 
the  fitted  value  at  state  i  depends  on  the  target  value  at  state  j,  (Presumably 
this  happens  when  the  averager  considers  states  i  and  j  somehow  similar.) 

The  reason  this  process  is  associated  with  the  averager  is  that  its  backup 
operator  is  Ma-  To  see  why,  consider  the  backed  up  value  at  some  state  i 
given  the  starting  value  function  v.  It  is  equal  to  the  cost  of  leaving  state  i, 
which  is  ki,  plus  the  expected  value  of  the  next  state,  which  is  Aj^(j); 
in  other  words,  it  is  equal  to  the  ith  component  of  Mav.  Figure  2.4(b)  shows 
one  example  of  a  simple  averager  viewed  as  a  Markov  process;  this  averager  has 

=  /?23  =  i033  =  1  and  all  other  coefficients  zero. 

The  simplest  way  to  combine  M  with  the  process  for  the  averager  is  to 
interleave  their  transitions,  that  is,  to  use  the  next-state  function  from  M  on 
odd  time  steps  and  the  next-state  function  from  the  averager  on  even  time  steps. 
The  result  is  an  MDP  whose  next-state  function  depends  on  time.  To  avoid  this 
dependence  we  can  combine  adjacent  pairs  of  time  steps,  leaving  an  MDP  whose 
next-state  function  is  essentially  the  composition  of  the  original  two  next-state 
functions.  (We  need  to  be  careful  about  defining  the  actions  of  the  combined 
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MDP:  in  general  a  combined  action  needs  to  specify  an  action  for  the  first  step 
and  an  entire  policy  for  the  second  step.  In  our  case,  though,  the  second  step 
is  the  Markov  process  for  the  averager,  which  only  has  one  possible  action.  So, 
the  actions  for  the  combined  MDP  are  the  same  as  the  actions  for  M.)  It  is  not 
too  hard  to  see  that  the  backup  operator  for  this  new  MDP  is  Tm  o  Ma,  which 
is  the  same  as  a  single  step  of  the  fitted  value  iteration  algorithm. 

As  mentioned  above,  the  state  space  for  the  new  MDP  is  the  same  as  the 
state  space  for  M.  However,  since  /3ij  is  zero  if  state  j  is  not  in  the  sample  Xq, 
there  will  be  zero  probability  of  visiting  any  state  outside  the  sample  after  the 
first  time  step.  So,  we  can  ignore  the  states  in  5  \  Xq.  In  other  words,  the  new 
MDP  is  embedded  inside  the  old. 

The  embedded  MDP  is  the  same  as  the  original  one  except  that  after  every 
step  the  agent  gets  randomly  scattered  (with  probabilities  depending  on  the 
averager)  from  its  current  state  to  some  nearby  state  in  Xq.  So,  if  a  transition 
leads  from  x  to  y  in  the  original  MDP,  and  if  the  averager  considers  state 
z  e  Xq  similar  to  y,  then  the  same  transition  in  the  embedded  MDP  has  a 
chance  of  moving  the  agent  from  x  to  z.  Figure  2.4  shows  a  simple  example  of 
the  embedded  MDP;  a  slightly  more  complicated  example  is  in  Figure  2.5.  As 
the  following  theorem  shows  (see  Section  2.7.2  for  a  proof),  exact  value  iteration 
on  the  embedded  MDP  is  the  same  as  fitted  value  iteration  on  the  original  MDP. 
A  similar  theorem  holds  for  the  Q-learning  algorithm;  see  Section  2.7.4. 

Theorem  2.2  (Embedded  MDP)  For  any  averager  A  with  mapping  Ma) 
and  for  any  MDP  M  (either  discounted  or  nondis counted)  with  parallel  value 
backup  operator  Jm,  the  function  Tj[^  oMa  is  the  parallel  value  backup  operator 
for  a  new  Markov  decision  process  M\ 

In  general,  the  backup  operator  for  the  embedded  MDP  may  not  be  a  con¬ 
traction  in  any  norm.  Figure  2.4  shows  an  example  where  this  backup  operator 
diverges,  since  the  embedded  MDP  has  a  state  with  infinite  cost.  However,  we 
can  often  guarantee  that  the  embedded  MDP  is  well-behaved.  For  example, 
if  M  is  discounted,  or  if  A  uses  weight  decay  (ie.,  if  /3ij  <  1  for  all  ^), 
then  Tm  o  Ma  will  be  a  max  norm  contraction.  Other  conditions  for  the  good 
behavior  of  the  embedded  MDP  are  discussed  in  [Gor95a]. 


2.3  Converging  to  what? 

Until  now,  we  have  only  considered  the  convergence  or  divergence  of  fitted  dy- 
namic  programming  algorithms.  Of  course  we  would  like  not  only  convergence, 
but  convergence  to  a  reasonable  approximation  of  the  value  function. 

Suppose  that  M  is  an  MDP  with  value  function  v* ,  and  let  A  be  an  averager. 
What  if  V*  is  also  a  fixed  point  of  M4?  Then  u*  is  a  fixed  point  of  TmoMa]  so 
if  we  can  show  that  Tm  o  Ma  converges  to  a  unique  answer,  we  will  know  that 
it  converges  to  the  right  answer.  For  example,  if  M  is  discounted,  or  if  it  has 
E(c(x,u))  >  0  for  all  x  7^  ©,  then  Tm  o  Ma  will  converge  to  v*  f 
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If  we  are  trying  to  solve  a  nondiscounted  MDP  and  I?*  differs  slightly  from 
the  nearest  fixed  point  of  M^,  arbitrarily  large  errors  are  possible.  If  we  are 
trying  to  solve  a  discounted  MDP,  on  the  other  hand,  we  can  prove  a  much 
stronger  result:  if  we  only  know  that  the  optimal  value  function  is  near  a  fixed 
point  of  our  averager,  we  can  guarantee  an  error  bound  for  our  learned  value 
function.  (A  bound  immediately  follows  (see  e,g.  [SY94])  for  the  loss  incurred  by 
following  the  corresponding  greedy  policy.)  For  a  proof  of  the  following  theorem 
see  Section  2.7.3. 

Theorem  2.3  Letv*  he  the  optimal  value  function  for  a  finite  Markov  decision 
process  M  with  discount  factor  7.  Let  Tm  he  the  parallel  value  backup  operator 
for  M,  Let  Ma  he  a  nonexpansion  in  max  norm.  Let  he  any  fixed  point  of 
Ma-  Suppose  II  —  t;*  II  =  e,  where  ||  •  ||  denotes  max  norm.  Then  iteration  of 
Tm  o  Ma  converges  to  a  value  function  vq  so  that 

ii«--t,ii  <  ^ 

||t,'-M4(»o)||  <  2«  +  ^ 

Others  have  derived  similar  bounds  for  smaller  classes  of  function  approxi¬ 
mators.  For  example,  for  a  bound  on  the  error  introduced  by  approximating  a 
continuous  MDP  with  a  grid,  see  [CT89]. 

The  sort  of  error  bound  which  we  have  proved  is  particularly  useful  for  ap¬ 
proximators  such  as  linear  interpolation  and  grids  which  have  many  fixed  points. 
Because  it  depends  on  the  maximum  difference  between  v*  and  the  bound 
is  not  very  useful  if  v*  may  have  large  discontinuities  at  unknown  locations;  if 
V*  has  a  discontinuity  of  height  d,  then  any  averager  which  can’t  mimic  the 
location  of  this  discontinuity  exactly  will  have  no  representable  functions  (and 
therefore  no  fixed  points)  within  ~  oiv* . 

2,4  In  practice 

The  most  common  problems  with  approximate  value  iteration  are  oversmooth¬ 
ing  and  the  introduction  of  barriers  into  the  embedded  MDP.  By  the  introduc¬ 
tion  of  barriers,  we  mean  that  sometimes  the  embedded  MDP  can  be  divided 
into  two  pieces  so  that  the  first  piece  contains  the  goal  and  the  second  piece 
has  no  transitions  into  the  first.  In  this  case,  the  estimated  values  of  the  states 
in  the  second  piece  will  be  infinite.  (A  special  case  of  this  situation  is  that, 
if  the  averager  ignores  the  goal  state,  then  the  embedded  MDP  will  have  no 
transitions  into  the  goal.)  A  less  drastic  but  similar  problem  occurs  when  the 
second  piece  has  only  low-probability  transitions  to  the  first;  in  this  case,  the 
costs  for  states  in  the  second  piece  will  not  be  infinite,  but  will  still  be  artificially 
inflated. 

This  sort  of  problem  is  likely  to  happen  when  the  MDP  has  short  transitions 
and  when  there  are  large  regions  where  a  single  state  dominates  the  averager, 
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For  a  particularly  bad  example,  suppose  our  function  approximator  is  1-nearest- 
neighbor.  If  the  transitions  out  of  a  sampled  state  x  in  the  original  MDP  are 
shorter  than  half  the  distance  to  the  nearest  adjacent  sampled  state,  then  the 
only  transitions  out  of  x  in  the  embedded  MDP  will  lead  straight  back  to  x. 
So,  X  will  have  infinite  cost  in  the  embedded  MDP.  Similarly,  in  local  weighted 
averaging  with  a  narrow  kernel,  a  short  transition  out  of  x  in  the  original  MDP 
will  translate  to  a  high  probability  self  loop  in  the  embedded  MDP,  causing  x 
to  have  a  finite  but  very  large  cost.  In  both  of  these  examples,  we  can  imagine 
that  the  averager  is  producing  a  drag  on  transitions  out  of  a:,  so  that  actions 
in  the  embedded  MDP  don’t  get  the  agent  as  far  on  average  as  they  did  in  the 
original  MDP. 

One  way  to  avoid  creating  barriers  in  the  embedded  MDP  is  to  make  sure 
that  no  single  state  has  the  dominant  weight  over  a  large  region.  The  best  way  to 
do  so  is  to  sample  the  state  space  more  densely;  but  if  we  could  afford  to  do  that, 
we  wouldn’t  need  a  function  approximator  in  the  first  place.  Another  way  is  to 
increase  a  smoothing  parameter  such  as  kernel  width  or  number  of  neighbors, 
and  so  reduce  the  weight  of  each  sample  point  in  its  immediate  neighborhood. 
Unfortunately,  increasing  the  amount  of  smoothing  risks  oversmoothing. 

Oversmoothing  happens  when  a  function  approximator  interacts  with  value 
iteration  to  wash  out  the  features  of  the  value  function  that  we  are  interested  in. 
In  oversmoothing,  the  function  approximator  could  learn  a  good  approximation 
to  the  value  function  if  it  were  trained  by  supervised  learning,  but  fitted  value 
iteraction  still  converges  to  a  bad  approximation.  For  example,  if  the  agent 
must  follow  a  long,  narrow  path  to  the  goal,  the  scattering  effect  of  a  wide- 
kernel  averager  is  almost  certain  to  push  it  off  of  the  path  before  it  reaches  the 
end. 

Figure  2.6  demonstrates  oversmoothing  in  a  simple  one-dimensional  Markov 
process.  In  this  process,  the  state  space  is  the  interval  [0, 1].  The  agent  moves 
left  a  distance  of  .1  every  time  step,  except  when  its  position  is  already  left 
of  .1,  in  which  case  it  just  moves  to  the  origin.  The  state  x  =  0  is  terminal. 
The  cost  at  state  x  is  .lcos(207rx),  except  that  if  a;  <  .1  the  cost  is  pro-rated 
by  the  distance  moved.  Since  the  period  of  cos(207rx)  is  equal  to  the  distance 
moved  on  each  step,  the  agent’s  cost  to  move  a  given  distance  remains  constant 
throughout  each  trajectory  and  depends  only  on  the  trajectory’s  starting  state. 

The  four  graphs  in  Figure  2.6  show  the  performance  of  fitted  value  iteration 
with  A:-nearest-neighbor  for  fc  =  1,5, 10, 15,  The  solid  line  in  each  graph  shows 
the  true  value  function  v{x)  =  xcos(207ra;).  The  dashed  line  shows  the  approxi¬ 
mation  to  v{x)  computed  by  fitted  value  iteration  with  fe-nearest-neighbor.  For 
^  =  1  this  approximation  is  good,  while  for  larger  values  of  k  it  cuts  off  the 
peaks  of  v{x).  To  demonstrate  that  this  problem  is  not  just  due  to  the  inherent 
smoothing  in  fe-nearest-neighbor,  the  dotted  line  in  each  graph  shows  the  ap¬ 
proximation  to  v{x)  computed  by  supervised  learning.  For  larger  values  of  k  it 
is  clear  that,  while  some  of  the  smoothing  comes  from  fc-nearest-neighbor  itself, 
combining  A^-nearest-neighbor  with  fitted  value  iteration  amplifies  the  problem. 

The  reason  for  oversmoothing  is  that  fitted  value  iteration  applies  the  func¬ 
tion  approximator  Mj^  over  and  over  again  to  the  candidate  value  function. 
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Figure  2.6:  Oversmoothing  in  /.-nearest-neighbor,  for  k  =  1,5,10,15  out  of  a 
sample  of  200  states.  The  solid  line  is  the  true  value  function,  the  dashed  line 
is  its  approximation  with  fitted  value  iteration  and  A:-nearest-neighbor,  and  the 
dotted  line  is  its  approximation  with  supervised  learning  and  fc-nearest-neigbor. 


Since  Ma  by  definition  loses  some  information,  multiple  applications  of  Ma  may 
lose  so  much  information  that  the  resulting  approximation  to  the  value  function 
is  useless.  This  problem  hits  some  function  approximators  harder  than  others: 
while  methods  like  state  aggregation  and  linear  interpolation  don’t  usually  suf¬ 
fer  too  badly,  methods  like  fc-nearest  neighbor  with  large  k  and  local-weighted 
averaging  with  a  wide  kernel  can  have  problems. 

To  see  why  fitted  value  iteration  behaves  differently  with  A:-nearest-neighbor 
than  with  linear  interpolation,  consider  what  happens  if  we  are  lucky  enough 
that  the  function  approximator  can  represent  the  true  value  function  exactly — 
that  is,  suppose  v*  =  Mav  for  some  v,  (The  situation  will  be  qualitatively 
similar  if  we  can  just  represent  something  close  to  the  true  value  function.)  If 
we’re  using  linear  interpolation,  the  above  assumption  means  that  v*  is  a  fixed 
point  of  since  reapplying  linear  interpolation  to  a  linearly-interpolated 
function  doesn’t  change  anything.  So,  v*  will  be  a  fixed  point  of  the  fitted  value 
iteration  update  TmoMa^  and  we  will  end  up  with  zero  error.  On  the  other  hand, 
reapplying  fc-nearest-neighbor  does  change  the  result  (that  is,  Mav  MaMav 
in  general),  so  fitted  value  iteration  with  A:-nearest-neighbor  can  drift  away  from 
V*  and  end  up  somewhere  else. 

Both  of  the  above  problems  —  too  much  smoothing  and  the  introduction  of 
barriers  —  can  be  reduced  if  we  can  alter  our  MDP  so  that  the  actions  move 
the  agent  farther.  For  example,  we  might  look  ahead  two  or  more  time  steps 
at  each  value  backup.  (This  strategy  corresponds  to  the  dynamic  programming 
operator  o  Ma  for  some  n  >  1.  Since  is  the  backup  operator  for  the 
MDP  derived  by  composing  n  copies  of  M,  the  previous  sections’  convergence 


23 


theorems  also  apply  to  o  Ma-)  While  in  general  the  cost  of  looking  ahead 
n  steps  is  exponential  in  n,  there  are  many  circumstances  where  we  can  reduce 
this  cost  dramatically.  For  instance,  in  a  physical  simulation,  we  can  choose 
a  longer  time  increment;  in  a  grid  world,  we  can  consider  only  the  compound 
actions  which  don’t  contain  two  steps  in  opposite  directions;  and  in  the  case  of 
a  Markov  process,  where  there  is  only  1  action,  the  cost  of  lookahead  is  linear 
rather  than  exponential  in  n.  (In  the  last  case,  TD(A)  [Sut88]  allows  us  to 
combine  lookaheads  at  several  depths.)  If  actions  are  selected  from  an  interval 
of  E,  numerical  minimum-finding  algorithms  such  as  Newton’s  method  or  golden 
section  search  can  find  a  local  minimum  quickly.  In  any  case,  if  the  depth  and 
branching  factor  are  large  enough,  standard  heuristic  search  techniques  can  at 
least  chip  away  at  the  base  of  the  exponential. 


2.5  Experiments 

This  section  describes  our  experiments  with  several  Markov  decision  problems: 
two  taken  from  [BM95],  and  one  which  shows  that  fitted  value  iteration  can 
learn  value  functions  in  extremely  high-dimensional  state  spaces. 


2.5.1  Puddle  world 

In  this  world,  the  state  space  is  the  unit  square,  and  the  goal  is  the  upper  right 
corner.  The  agent  has  four  actions,  which  move  it  up,  left,  right,  or  down  by  .1 
unit  per  step.  The  cost  of  each  action  depends  on  the  current  state:  for  most 
states,  it  is  the  distance  moved,  but  for  states  within  the  two  “puddles,”  the 
cost  is  higher.  See  figure  2.7. 

For  a  function  approximator,  we  will  use  bilinear  interpolation,  defined  as 
follows:  to  find  the  predicted  value  at  a  point  {x,y),  first  find  the  corners  {xo,yo), 
ixo,yi),  {xi,yo),  and  (a:i,j/i)  of  the  grid  square  containing  {x,y).  Interpolate 
along  the  left  edge  of  the  square  between  ixo,yo)  and  ixo,yi)  to  find  the  pre¬ 
dicted  value  at  ixo,y).  Similarly,  interpolate  along  the  right  edge  to  find  the 
predicted  value  at  (0:1,3/).  Now  interpolate  across  the  square  between  (xo,y) 
and  (xi ,  y)  to  find  the  predicted  value  at  (x,  y) . 

Figure  2.7  shows  the  cost  function  for  one  of  the  actions,  the  optimal  value 
function  computed  on  a  100  x  100  grid,  an  estimate  of  the  optimal  value  function 
computed  with  bilinear  interpolation  on  the  corners  of  a  7  x  7  grid  (i.e.,  on  64 
sample  points),  and  the  difference  between  the  two  estimates.  Since  the  optimal 
value  function  is  nearly  piecewise  linear  outside  the  puddles,  but  curved  inside, 
the  interpolation  performs  much  better  outside  the  puddles:  the  root  mean 
squared  difference  between  the  two  approximations  is  2.27  within  one  step  of 
the  puddles,  and  .057  elsewhere.  (The  lowest-resolution  grid  which  beats  bilinear 
interpolation’s  performance  away  from  the  puddles  is  20  x  20;  but  even  a  5  x  5 
grid  can  beat  its  performance  near  the  puddles.) 
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Figure  2.7:  The  puddle  world.  Prom  top  left:  the  cost  of  moving  up,  the  optimal 
value  function  as  seen  by  a  100  x  100  grid,  the  optimal  value  function  as  seen  by 
bilinear  interpolation  on  the  corners  of  a  7  x  7  grid,  and  the  second  value  function 
minus  the  first.  Some  plots  are  intentionally  cut  off  at  the  top  to  preserve  a 
constant  z  scale  and  to  show  detail. 
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2.5.2  Hill-car 


In  this  world,  the  agent  must  drive  a  car  up  to  the  top  of  a  steep  hill.  Un¬ 
fortunately,  the  car’s  motor  is  weak:  it  can’t  climb  the  hill  from  a  standing 
start.  So,  the  agent  must  back  the  car  up  and  get  a  running  start.  The  state 
space  is  [-1, 1]  X  [-2, 2],  which  represents  the  position  and  velocity  of  the  car; 
there  are  two  actions,  forward  and  reverse.  (This  formulation  differs  slightly 
from  [BM95]:  they  allowed  a  third  action,  coast.  We  expect  that  the  difference 
makes  the  problem  no  more  or  less  difficult.)  The  cost  function  measures  time 
until  goal. 

There  are  several  interesting  features  to  this  world.  First,  the  value  function 
contains  a  discontinuity  despite  the  continuous  cost  and  transition  functions: 
there  is  a  sharp  transition  between  states  where  the  agent  has  just  enough 
speed  to  get  up  the  hill  and  those  where  it  must  back  up  and  try  again.  Since 
most  function  approximators  have  trouble  representing  discontinuities,  it  will  be 
instructive  to  examine  the  performance  of  approximate  value  iteration  in  this 
situation.  Second,  there  is  a  long,  narrow  region  of  state  space  near  the  goal 
through  which  all  optimal  trajectories  must  pass  (it  is  the  region  where  the  car 
is  partway  up  the  hill  and  moving  quickly  forward).  So,  excessive  smoothing 
will  cause  errors  over  large  regions  of  the  state  space.  Finally,  the  physical 
simulation  uses  a  fairly  small  time  step,  .03  seconds,  so  we  need  fine  resolution 
in  our  function  approximator  just  to  make  sure  that  we  don’t  introduce  a  barrier. 

The  results  of  our  experiments  appear  in  figure  2.8.  For  a  reference  model, 
we  fit  a  128  X 128  grid.  While  this  model  has  16384  parameters,  it  is  still  less  than 
perfect:  the  right  end  of  the  discontinuity  is  somewhat  rough.  (Boyan  and  Moore 
used  a  200  by  200  grid  to  compute  their  optimal  value  function,  and  it  shows 
no  perceptible  roughness  at  this  boundary.)  We  also  fit  two  smaller  grids,  one 
64  X  64  and  one  32  x  32.  Finally,  we  fit  a  weighted  4-nearest  neighbor  model  using 
the  1024  centers  of  the  cells  of  the  32  x  32  grid  as  sample  points,  and  another 
using  a  uniform  random  sample  of  1000  points  from  the  state  space.  Note 
that  the  nearest-neighbor  methods  are  roughly  comparable  in  complexity  to  the 
32  X  32  grid:  each  one  requires  us  to  evaluate  about  two  thousand  transitions 
in  the  MDP  for  every  value  backup. 

As  the  difference  plots  show,  most  of  the  error  in  the  smaller  models  is 
concentrated  around  the  discontinuity  in  the  value  function.  Near  the  disconti¬ 
nuity,  the  grids  perform  better  than  the  nearest-neighbor  models  (as  we  would 
expect,  since  the  nearest-neighbor  models  tend  to  smooth  out  discontinuities). 
But  away  from  the  discontinuity,  the  nearest-neighbor  models  win.  The  32  x  32 
nearest-neighbor  model  also  beats  the  32  x  32  grid  at  the  right  end  of  the  dis¬ 
continuity:  the  car  is  moving  slowly  enough  here  that  the  grid  thinks  that  one 
of  the  actions  keeps  the  car  in  exactly  the  same  place.  The  nearest-neighbor 
model,  on  the  other  hand,  since  it  smooths  more,  doesn’t  introduce  as  much 
drag  as  the  grid  does  and  so  doesn’t  have  this  problem.  The  root  mean  square 
error  of  the  64  x  64  grid  (not  shown)  from  the  reference  model  is  0.190s,  and  of 
the  32  X  32  grid  is  0.336s.  The  RMS  error  of  the  4-nearest-neighbor  fitter  with 
samples  at  the  grid  points  is  0.205s,  The  nearest-neighbor  fitter  with  a  random 
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Figure  2.8:  Approximations  to  the  value  function  for  the  hill-car  problem.  Prom 
the  top:  the  reference  model,  a  32  x  32  grid,  a  fc-nearest-neighbor  model,  the 
error  of  the  32  x  32  grid,  and  the  error  of  the  fc-nearest-neighbor  modeh  In 
each  plot,  the  x  axis  represents  the  agent’s  position,  the  y  axis  represents  its 
velocity,  and  the  axis  represents  the  estimated  time  remaining  until  reaching 
the  summit  at  a:  =  .6. 


Figure  2.9:  Two  smaller  models  for  the  hill-car  world:  a  divergent  12  x  12  grid, 
and  a  convergent  nearest-neighbor  model  on  the  same  144  sample  points. 


sample  (not  shown)  performs  slightly  worse,  but  still  significantly  better  than 
the  32  X  32  grid  (one-tailed  t-test  gives  p  =  .971):  its  error,  averaged  over  5 
runs,  is  0.235s. 

All  of  the  above  models  are  fairly  large:  the  smallest  one  requires  us  to 
evaluate  2000  transitions  for  every  value  backup.  Figure  2.9  shows  what  happens 
when  we  try  to  fit  a  smaller  model.  The  12  x  12  grid  is  shown  after  60  iterations; 
it  is  in  the  process  of  diverging,  since  the  transitions  are  too  short  to  reach  the 
goal  from  adjacent  grid  cells.  The  4-nearest-neighbor  fitter  on  the  same  144  grid 
points  has  converged;  its  RMS  error  from  the  reference  model  is  0.278s  (better 
than  the  32  x  32  grid,  despite  needing  to  simulate  fewer  than  one-seventh  as 
many  transitions).  A  4-nearest-neighbor  fitter  on  a  random  sample  of  size  150 
(not  shown)  also  converged,  with  RMS  error  0.423s. 

2.5.3  Hill-car  the  hard  way 

In  the  previous  section’s  formulation  of  this  world  the  state  space  is  [-1, 1]  x 
[“2, 2],  representing  the  position  and  velocity  of  the  car.  As  we  saw,  this  state 
space  is  small  enough  that  value  iteration  on  a  reasonably-sized  grid  (1000  to 
40000  cells,  depending  on  the  desired  accuracy)  can  find  the  optimal  value  func¬ 
tion.  To  test  fitted  value  iteration,  we  expanded  the  state  space’s  dimensionality 
a  thousandfold:  instead  of  position  and  velocity,  we  represented  each  state  with 
two  32  X  32  grayscale  pictures  like  the  ones  in  figure  2.10(a),  making  the  new 
state  space  [0, 1]^°'*®.  The  top  picture  shows  the  car’s  current  position;  the  bot¬ 
tom  one  shows  where  it  would  be  in  ,03s  if  it  took  no  action.  A  simple  grid  on 
this  expanded  state  space  is  unthinkable:  even  if  we  discretized  to  just  two  gray 
levels  per  pixel,  the  grid  would  have  2^“^®  cells. 

To  approximate  the  value  function,  we  took  a  random  sample  of  5000  legal 
pictures  and  ran  fitted  value  iteration  with  local  weighted  averaging.  In  local 
weighted  averaging,  the  fitted  value  at  state  x  is  an  average  of  the  target  values 
at  nearby  sampled  states  x' ,  weighted  by  a  Gaussian  kernel  centered  at  x.  We 
used  a  symmetric  kernel  with  height  1  at  the  center  and  height  \  when  the 
Euclidean  distance  from  x'  to  x  was  about  22,  (We  arrived  at  this  kVnel  width 
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(a)  (b) 


(c) 

Figure  2.10:  The  hilbcar  world. 
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by  a  coarse  search:  it  is  the  narrowest  kernel  width  we  tested  for  which  the 
embedded  MDP  was  usually  connected.)  We  repeated  the  experiment  three 
times  and  selected  the  run  with  the  median  RMS  error. 

The  resulting  value  function  is  shown  in  figure  2.10(b);  its  RMS  error  from 
the  exact  value  function  (figure  2.10(c))  is  0.155s.  By  comparison,  a  70  x  71 
grid  on  the  original,  two-dimensional  problem  has  RMSE  0.186s. 


2.6  Summary 

In  this  chapter  we  described  an  algorithm  called  fitted  value  iteration,  which  is 
a  generalization  of  state  aggregation  to  handle  any  function  approximator  that 
is  a  nonexpansion  in  max  norm.  Such  approximators  include  A:-nearest-neighbor 
and  linear  and  multilinear  interpolation.  We  proved  convergence  rate  and  error 
bounds  for  fitted  value  iteration  applied  to  a  discounted  Markov  decision  process. 

To  analyze  fitted  value  iteration  applied  to  a  nondiscounted  MDP,  we  added 
the  additional  constraints  that  the  function  approximator  be  linear  and  mono¬ 
tone.  The  resulting  class  of  approximators,  called  averagers,  still  contains  most 
of  the  popular  nonexpansive  approximators.  We  showed  that  running  fitted 
value  iteration  with  an  averager  on  an  MDP  M  is  equivalent  to  running  exact 
value  iteration  on  a  new,  smaller  MDP  embedded  within  M. 

Finally,  we  ran  experiments  which  demonstrate  that  the  combination  of  fit¬ 
ted  value  iteration  with  an  averager  can  solve  problems  that  require  both  pattern 
recognition  and  planning.  These  experiments  show  that  fitted  value  iteration 
significantly  extends  the  range  of  problems  that  we  can  solve  with  a  provably- 
convergent  algorithm. 

2.7  Proofs 

This  section  contains  proofs  that  were  omitted  from  the  main  text,  It  can  be 
skipped  without  loss  of  continuity. 

2,7.1  Can  expansive  approximators  work? 

The  following  theorem  is  almost  a  converse  of  Theorem  2.1.  Instead  of  showing 
that  nonexpansion  of  M4  is  necessary  to  guarantee  convergence  for  all  MDPs 
(which  would  be  equivalent  to  showing  that  the  existence  of  two  points  x  and 
y  with  \\Max  -  MavIIoo  >  ||a;  -  y||oo  is  enough  to  find  an  MDP  for  which 
fitted  value  iteration  does  not  converge),  it  requires  the  additional  condition 
that  X  <y. 

Theorem  2.4  Suppose  that  the  approximator  A  has  mapping  Ma,  and  suppose 
that  there  are  two  value  functions  x  <y  such  that 

\\MaX  -  MaVWoo  >  llx  -  J/lloo 
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Then  there  exists  a  Markov  process  M  (with  a  finite  value  function)  such  that 
fitted  value  iteration  with  approximator  A  does  not  converge  to  a  unique  answer 
when  applied  to  M. 


Proof:  Write  Maoo  =  v  and  MaV  =  w.  We  will  construct  a  Markov  process 
M  such  that  the  backup  operator  Tm  has  either  Tmv  =  x  and  Tmw  =  y  or 
Tm'w  =  X  and  Tmv  =  y.  In  the  former  case  the  fitted  value  iteration  operator 
Tm  o  Ma  will  have  at  least  two  fixed  points,  namely  x  and  y,  while  in  the  latter 
case  Tm  o  Ma  will  have  a  limit  cycle  that  alternates  between  x  and  y.  In  either 
case  fitted  value  iteration  will  not  converge  to  a  unique  answer. 

Let  s  be  any  state  where  v  and  w  differ  by  the  maximum  amount,  that  is, 
with  |i;(5)  —  ty(s)|  =  lit?  —  it;||oo-  We  will  define  the  process  M  so  that  every 
transition  will  end  either  in  s  or  in  the  terminal  state  ©.  First  suppose  that 
?;(s)  <  w{s).  Let  i  be  an  arbitrary  state.  By  assumption  0  <  y{i)  —  x{i)  < 
w{s)  —  v{s).  We  define  M^s  transition  function  so  that,  if  i  is  the  current  state, 
the  next  state  is  s  with  probability 


p{t)  = 


y(i)  -  x{i) 
w{s)  —  v{s) 


and  ©  with  probability  1  —  p{i).  We  define  M’s  cost  function  so  that  the  cost 
of  leaving  state  i  is  x{i)  —p{i)v{s).  With  these  definitions,  we  can  compute 


{TMv){i)  =  p{i)v{s)  4-  x{i)  -  p{i)v{s)  =  x{i) 


{TM'w){i)  =  p{i){w{s)  -  ?;(5))  +  x{i)  =  y[i) 


So,  we  have  Tmv  =  x  and  =  y. 

Now  suppose  that  w{s)  <  v{s).  In  this  case  0  < 
so  we  can  define 


p{i) 


y{i)  -  x{i) 
u(s)  —  w{s) 


y{i)  -  x{i)  <  v{s)  -  u){s), 


and  set  the  cost  of  leaving  state  ^  to  y{i)  —  p(0u(s).  Then 


{TMv){i)  =  p{i)v{s)  A  y{i)  -  pii)v{s)  =  y{i) 

{TMw){i)  =  p{i){w{s)  -  v{s))  +  y{i)  =  x{i) 

So,  Tmv  =  y  and  Tm'^  ~  x. 

In  either  case,  p{i)  <  1  for  all  i,  so  M  reaches  the  terminal  state  ©  with 
probability  1  from  any  initial  state.  Therefore  M’s  value  function  is  finite  as 
required.  □ 


2.7,2  Nondiscounted  case 

This  proof  uses  the  definition  of  an  averager  from  [Gor95a],  which  is  slightly 
less  general  than  the  one  given  here.  The  proof  works  with  only  minor  changes 
for  the  more  general  definition, 
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Proof  of  Theorem  2.2:  Define  the  embedded  MDP  M'  as  follows.  It  will 
have  the  same  state  and  action  spaces  as  M,  and  it  will  also  have  the  same 
discount  factor  and  initial  distribution.  We  can  assume  without  loss  of  generality 
that  state  1  of  M  is  cost-free  and  absorbing:  if  not,  we  can  renumber  the  states 
of  M  starting  at  2,  add  a  new  state  1  which  satisfies  this  property,  and  make  all 
of  its  incoming  transition  probabilities  zero.  We  can  also  assume,  again  without 
loss  of  generality,  that  Pi  =  1  and  /?!  =  0  (that  is,  that  A  always  sets  ?;(!)  =  0) 
—  again,  if  this  property  does  not  already  hold  for  A,  we  can  add  a  new  state 
1. 

Suppose  that,  in  M,  action  a  in  state  x  takes  us  to  state  y  with  probability 
Paxy  Suppose  that  A  replaces  v{y)  by  pyky  +  Y,z  l^yzv{z).  Then  we  will  define 
the  transition  probabilities  in  M'  for  state  x  and  action  a  to  be 

Paxz  ~  ^  ]PaxyPyz  {z  ^  1) 
y 

Paxl  ~  ^  ^  Paxyi^Pyl  Py) 

y 

These  transition  probabilities  make  sense:  since  A  is  an  averager,  we  know  that 
Ez  Pyz  +  Py  is  1,  so 


^  jPaxz  ~  ^  V  ^  ^  Paxy  Pyz  +  ^  ^^Paxy{Pyl  Py) 

^  z:^l  y  y 

^  ^^Paxy  ^^^Pyz  +  P^ 

~  ^^Paxy  =  1 
y 

Now  suppose  that,  in  M,  performing  action  a  from  state  x  yields  expected 
cost  Cxa‘  Then  performing  action  a  from  state  x  in  M'  yields  expected  cost 

^xa  ^xa  "i"  y  Paxy'  Py'  k^y' 

y' 

Now  the  parallel  value  backup  operator  Tm'  for  M'  is 
v{x)  4-  mm£?(c'(a:,a) -h7u(<5'(a:,a))) 

=  mm  (4„  +  jv{z)) 

Z 

=  ““  {^Paxy^V^  (4a  +  iv{z)) 

+  ( yi  Paxy  {Pyl  +  (4a +7^1)) 
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mm  ^ Past,  j  +7^(2;))  +  Wyl  +  I3y)c'^, 


=  mm  ^  Paxy  4o  +  7  ^  ^yzV{z) 
“  y  \ 


=  mm  4a  +  7  ^  Paxy  ^  PyzV{z) 
^  \  y  z^i 


=  mm  Cxa+iYl^^^y'^y'^y' 

\  y'  y  z^i 

On  the  other  hand,  the  parallel  value  backup  operator  for  M  is 

v{x)  <-  min  E{c{x,  a)  -h  jv{d(x,  a))) 

a 

=  min  ^  Paxy  {Cxa  +  'Tviv)) 


If  we  replace  v{y)  by  its  approximation  under  A,  the  operator  becomes  Tm^Ma- 
v{x)  •H  mm  ^Paxj/  ^Ca:o  +  jWyky  +  ^  l3yxV{z))^ 

^  *mn  I  Cxa  "I"  7  ^  ',PaxyPyky  +  '){  ^  \Paxy  ^  ^  Pyz'^i.z')  1 
\  y  V  zr^i  J 


which  is  exactly  the  same  as  Tm'  above. 


2.7.3  Error  bounds 

Proof  of  Theorem  2.3:  By  Theorem  2.1,  Tm  °  Ma  is  a  contraction  in  max 
norm  with  factor  7,  and  therefore  converges  to  some  vq.  Repeated  application 
of  the  triangle  inequality  and  the  definition  of  a  contraction  give 

\\vo-TMiMAiv*))\\  =  \\TMiMA{vo))-TMiMA{v*))\\ 

<  7II  Vo  -  u*  II 

\\TMiMA{v*))-v*\\  =  ||Tm(M^K))--Tm(u*)|| 

<  7||M^(u*)-t;*|| 

<  7||Ma(u*) ||  +  7||i’^ -u*  II 

^  7ll  Ma{v*)  -  Ma{v^)  II  +  7ll  -  V*  II 

<  7II  U*  -  II  +  7II  -  u’'  II 

lbo-v*||  <  \\vo-TM{MAM)\\  +  \\TM{MA{vn)-v*\\ 


V*  II  +  2711 V*  -  II 

II 


which  is  what  was  required.  □ 

If  we  let  7  0,  we  can  make  the  above  error  bound  arbitrarily  small.  This 

result  is  somewhat  counterintuitive,  since  A  may  not  even  be  able  to  represent 
V*  exactly.  The  reason  for  this  behavior  is  that  the  final  step  in  computing  vo 
is  to  apply  Tm',  when  7  =  0,  this  step  produces  v*  immediately. 

Approximate  value  iteration  returns  Ma{vo)  rather  than  vq  itself.  So,  an 
error  bound  for  Ma{vo)  would  be  useful.  The  error  bound  on  vo  leads  directly 
to  a  bound  for  Ma{vo): 


(1-7)11^0  -V* 
II  Vo  -  V* 


<  ill  ^^0  - 

<  27||u*- 
1-7 


v*-Ma{vo)\\  < 

< 

< 

< 


\\v*-v^\\  +  \\v^-Ma{vo)\\ 
e  +  II  Ma{v^)  -  Ma{vo)  II 
e-\-\\v^  —  Vo  II 
e  +  II  -  u*  II  +  II V*  -  uo  II 


On  the  other  hand,  usually  we  use  Ma{vo)  by  doing  a  one-step  lookahead  to 
find  the  greedy  action;  since  this  lookahead  is  equivalent  to  applying  Tm  again, 
the  error  bound  on  vo  may  be  a  better  indicator  of  performance. 


2.7.4  The  embedded  process  for  Q-learning 

Here  is  the  analog  for  Q-learning  of  the  embedded  MDP  theorem.  (For  a  defini¬ 
tion  of  the  (5-learning  algorithm,  see  [Wat89].)  The  chief  difference  is  that,  where 
the  theorem  for  value  iteration  considered  the  combined  operator  Tm°Ma,  this 
version  considers  Ma<^T^  where  is  the  parallel  Q-learning  backup  operator. 
The  difference  is  necessary  to  keep  the  min  operation  in  the  Q-learning  backup 
from  getting  in  the  way.  Of  course,  if  we  show  that  either  o  Ma  or  Ma  o 
converges  from  any  initial  guess,  then  the  other  must  also  converge. 

This  proof  uses  the  definition  of  an  averager  from  [Gor95a],  which  is  slightly 
less  general  than  the  one  given  here.  The  proof  works  with  only  minor  changes 
for  the  more  general  definition. 

Theorem  2.5  (Embedded  MDP  for  Q-'learning)  For  any  averager  A  with 
mapping  Ma,  and  for  any  MDP  M  (either  discounted  or  nondiscounted)  with 
Q-leaming  backup  operator  T%,  the  function  MaoT^  is  the  Q-leaming  backup 
operator  for  a  new  Markov  decision  process  M' , 

Proof:  The  domain  of  A  will  now  be  pairs  of  states  and  actions.  Write  0xayb 
for  the  coefficient  of  Q{y,  b)  in  the  approximation  of  Q{x,ay,  write  kxa  and  0xa 
for  the  constant  and  its  coefficient. 
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We  now  interpret  the  first  parenthesis  above  as  the  cost  of  taking  action  c 
from  state  z  in  M';  the  second  parenthesis  is  the  transition  probability  for 
M' ,  Note  that  the  sum  YhyP'zcy  generally  be  less  than  1;  so  we  will  make 
up  the  difference  by  adding  a  transition  in  M'  from  state  with  action  c  to 
state  1  (which  is  assumed  as  before  to  be  cost-free  and  absorbing  and  to  have 
v{l)  =  0).  □ 


35 


36 


Chapter  3 

CONVEX  ANALYSIS 
AND  INFERENCE 
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This  chapter  presents  a  unified  framework  for  reasoning  about  worst-case 
regret  bounds  for  learning  algorithms.  This  framework  is  based  on  the  theory 
of  duality  of  convex  functions.  It  brings  together  results  from  computational 
learning  theory  and  Bayesian  statistics,  allowing  us  to  derive  new  proofs  of 
known  theorems,  new  theorems  about  known  algorithms,  and  new  algorithms. 

This  chapter  does  not  mention  Markov  decision  processes  explicitly.  Instead, 
its  results  are  at  a  more  basic  level:  they  treat  the  problem  of  learning  without 
independent,  identically  distributed  data.  This  problem  is  one  of  the  main 
reasons  that  learning  value  functions  for  MDPs  is  difficult,  but  there  are  other 
reasons  as  well,  so  in  order  to  use  this  chapter’s  results  in  a  value  function 
learning  algorithm  we  would  have  to  solve  some  additional  problems. 

Probably  the  most  difficult  of  these  problems  is  to  decide  how  to  score  a 
hypothesized  value  function.  An  ideal  scoring  method  should  take  as  input 
some  transitions  and  a  value  function,  then  decide  how  well  the  value  function 
explains  the  transitions.  It  should  take  into  account  how  likely  the  transitions 
are  to  have  produced  the  observed  Bellman  residuals,  and  also  how  well  the 
value  function  predicts  which  transitions  are  optimal  choices  from  their  starting 
states.  Also,  in  order  to  take  advantage  of  the  results  of  this  chapter,  the  score 
should  be  convex  in  its  value-function  input.  This  last  requirement  rules  out 
such  scoring  methods  as  squared  Bellman  error.  Chapter  4  discusses  in  more 
detail  the  problem  of  scoring  value  functions. 

Some  of  the  material  from  this  chapter  will  appear  in  [Gor99]. 

3.1  The  inference  problem 

We  are  interested  in  the  following  problem:  on  each  time  step  t  =  1 ...  T  we  must 
choose  a  prediction  vector  wt  from  a  set  of  allowable  predictions  W.  Then  the 
loss  function  lt{w)  is  revealed  to  us,  and  we  are  penalized  ki'ivt).  These  penalties 
are  additive,  so  our  overall  goal  is  to  minimize  Our  choice  of  wt 

may  depend  on  h  and  possibly  on  some  additional  prior  information, 

but  it  may  not  depend  on  Z* . . .  Zy. 

Many  well-known  inference  problems,  such  as  linear  regression  and  estima¬ 
tion  of  mixture  coefficients,  are  special  cases  of  this  one.  To  express  one  of  these 
specific  problems  as  an  instance  of  our  general  inference  problem,  we  will  usually 
interpret  the  loss  function  It  as  encoding  both  a  training  example  and  a  criterion 
to  be  minimized:  the  location  of  the  set  of  minima  of  k  encodes  the  training 
example,  while  the  shape  of  It  encodes  the  cost  of  deviations  in  each  direction. 
This  double  role  for  It  means  that  the  loss  function  will  usually  change  from  step 
to  step,  even  if  we  are  always  trying  to  minimize  the  same  kind  of  errors.  For 
example,  if  we  wanted  to  estimate  the  mean  of  a  population  of  numbers  from  a 
sample  zi,  2:2,  •  ■ then  might  be  {w  ^  zt^.  This  choice  of  k  encodes  both 
the  current  training  point  zt  and  the  fact  that  we  are  minimizing  squared  error. 
(See  Figure  3.1  for  more  detail.)  Or,  if  we  were  interested  in  a  linear  regression 
of  yt  on  xt,  ItM  might  be  {yt-w-  xt)^.  This  choice  encodes  both  the  current 
example  {xt,yt)  and  the  fact  that  we  want  to  minimize  the  squared  prediction 
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Figure  3.1:  An  example  of  the  MAP  algorithm  in  action,  trying  to  minimize 
sum  of  squared  errors.  The  prediction  at  trial  t  is  the  mean  of  all  examples  up 
to  trial  t-1,  while  the  comparison  vector  is  the  mean  of  all  examples  up  to  trial 
t. 


error.  Or,  if  we  were  trying  to  solve  a  mixture  estimation  problem,  lt{w)  might 
be  —  ln(«;  •  pf),  where  w  is  the  vector  of  mixture  proportions  and  pt,i  is  the 
probability  of  the  current  training  point  under  the  ith  model.  (Here  and  below, 
the  notation  pt^i  stands  for  the  ith  component  of  the  vector  pt.)  This  choice  of 
loss  function  encodes  properties  of  the  current  example  as  well  as  the  fact  that 
we  want  to  maximize  log-likelihood. 

We  want  to  develop  an  algorithm  for  choosing  a  sequence  of  WtS  so  as  to 
minimize  our  total  loss  even  if  the  sequence  of  loss  functions  It 

is  chosen  by  an  adversary.  Unfortunately  this  problem  is  impossible  without 
further  assumptions:  for  example,  the  adversary  could  choose  loss  functions 
with  corners  or  discontinuities  and  make  the  losses  of  two  predictions  Vt  and  Wt 
arbitrarily  different  even  if  vt  and  wt  were  close  together.  So,  we  will  make  two 
basic  simplifications.  The  first  is  that  we  will  place  restrictions  on  the  form  of 
the  functions  k  that  the  adversary  may  choose.  The  chief  restrictions  will  be 
that  It  is  convex  and  that  a  measure  of  the  amount  of  information  contained  in 
It  does  not  increase  too  quickly  from  trial  to  trial. 

The  second  simplification  is  that  we  will  seek  a  relative  loss  bound  rather 
than  an  absolute  one.  That  is,  we  will  define  a  comparison  class  ti  of  predictions, 
and  we  will  seek  to  minimize  our  regret  'El=iih{wt)  -  hiu))  versus  the  best 
predictor  u  E  K-  (Often  we  will  take  ZV  =  W,  so  that  we  are  comparing  our 
predictions  to  the  best  constant  prediction.  Sometimes,  though,  we  will  need  to 
take  ZZ  C  >V  in  order  to  prove  a  bound.)  Since  u  can  be  chosen  post  hoc,  with 
knowledge  of  the  loss  functions  It,  such  a  regret  bound  is  a  strong  statement. 

The  focus  on  regret  instead  of  just  loss  is  the  chief  place  where  our  results 
differ  from  traditional  statistical  estimation  theory.  It  is  what  allows  us  to 
handle  sequences  of  loss  functions  that  are  too  difficult  to  predict:  our  theorems 
will  still  hold,  but  since  there  will  be  no  comparison  u  that  has  small  loss,  the 
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theorems  will  not  tell  us  much  about  our  total  loss 

Surprisingly,  with  only  weak  restrictions  on  It  and  w,  we  will  be  able  to 
prove  bounds  that  are.similar  to  the  best  possible  average-case  bounds  (that  is, 
bounds  where  k  is  chosen  by  some  fixed  probability  law).  Our  theorems  will 
unify  results  from  classical  statistics  (inference  in  exponential  families  and  gen¬ 
eralized  linear  models)  with  those  from  computational  learning  theory  (weighted 
majority,  aggregating  algorithm,  exponentiated  gradient). 

This  regret  bound  framework  has  been  studied  before  in  [LW92,  KW97, 
KW96,  Vov90,  CBFH'^95]  among  others.  Also,  some  of  our  results  are  similar  to 
results  from  classical  statistics  such  as  the  Cramer-Rao  variance  bound  [S091]. 
Our  theorems  are  more  general  than  each  of  these  previous  results  in  at  least  one 
of  the  following  ways.  First,  they  apply  to  more  general  classes  of  convex  loss 
functions,  including  non-differentiable  ones.  Second,  they  apply  to  both  online 
(i.e.,  bounded  computation  per  example)  and  offline  (unbounded  computation) 
algorithms.  Third,  they  apply  to  all  sequences  of  loss  functions,  not  just  on 
average.  Finally,  they  apply  at  all  time  steps,  not  just  asymptotically.  Our 
theorems  are  also  less  general  than  traditional  statistical  results  in  some  ways. 
For  example,  while  the  Cramer-Rao  bound  requires  twice-differentiability  of  the 
loss  functions,  it  does  not  require  global  convexity,  just  local  convexity. 

All  of  our  theorems  will  concern  variations  on  the  following  simple  and  in¬ 
tuitively  appealing  algorithm,  which  takes  as  input  the  loss  functions  /i . . .  It-i 
observed  on  previous  trials  plus  one  additional  loss  function  Zq  which  encodes 
our  prior  knowledge  before  the  first  trial. 

MAP  Algorithm:  Predict  any  wt  E  argmin^y 

The  notation  argmin^^,  f{w)  means  the  set  of  ws  that  minimize  /.  We  assume 
that  the  minimum  is  always  achieved  so  that  a  legal  prediction  always  exists. 
Conditions  which  ensure  the  existence  are  described  below,  The  algorithm  is 
called  “MAP”  or  “maximum  a  posteriori”  because  of  its  Bayesian  roots:  if  we 
want  to  apply  the  MAP  algorithm  to  the  problem  of  estimating  some  population 
parameters  w  from  an  independent  identically  distributed  sample  2^1,  >^2?  •  then 
a  good  choice  of  loss  function  is  the  negative  of  the  log  likelihood  lt{w)  = 
^\np{zt\w).  With  this  setting  for  It  the  MAP  algorithm  always  chooses  the 
prediction  with  maximal  posterior  probability  given  the  available  information. 
Of  course,  we  can  still  use  the  MAP  algorithm  when  we  do  not  have  i.i.d. 
samples;  in  this  case  k  will  be  unrelated  to  any  likelihood,  and  so  “maximum  a 
posteriori”  may  be  a  misnomer. 

As  the  MAP  algorithm  is  stated  above  it  is  not  operational,  since  we  may 
not  know  how  to  perform  the  required  minimization.  A  striking  feature  of  the 
MAP  algorithm  is  that,  despite  the  complicated  machinery  required  to  prove 
its  theoretical  properties,  it  often  has  a  simple  and  efficient  implementation. 
In  fact,  as  we  will  see  below,  many  well-known  inference  algorithms  are  MAP 
algorithms. 

One  example  of  a  specific  implementation  of  the  MAP  algorithm  is  shown 
in  Figure  3.1.  In  this  example,  the  learner  is  trying  to  minimize  the  sum  of 
squared  distances  between  its  predictions  Wt  and  a  sequence  of  training  examples 
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Figure  3.2:  Definitions  for  convex  functions. 


Zi  .,,Z4.  For  this  problem  the  MAP  algorithm  will  always  predict  Wt  equal  to 
the  mean  of  all  examples  from  trials  0 ...  t  —  1.  (By  convention  we  set  zo  =  0.) 
As  shown  in  the  figure,  the  best  possible  constant  prediction  is  w  =  4,  since  that 
is  the  mean  of  zq  ^ •  z^.  The  total  loss  of  w  =  4  is  34,  so  the  regret  of  the  MAP 
algorithm  is  the  difference  between  the  loss  him)  and  34. 

The  rest  of  the  paper  is  organized  as  follows.  In  Section  3.2  we  will  review 
some  basic  facts  about  convex  analysis  that  we  will  need  later  on.  In  Section  3.3 
we  will  outline  our  main  results  and  the  strategy  that  we  will  use  to  prove 
them.  In  Sections  3.4  and  3.5  we  will  prove  loss  bounds  for  the  Weighted 
Majority  algorithm,  as  an  example  of  how  to  apply  the  results  from  Section  3.3. 
Section  3.6  introduces  the  generalized  gradient  descent  algorithm,  which  is  a 
special  case  of  the  MAP  algorithm.  Section  3.7  proves  regret  bounds  for  a 
general  class  of  MAP  algorithms  that  includes  generalized  gradient  descent. 
Section  3.8  gives  some  examples  of  generalized  gradient  descent,  including  one 
which  is  a  version  of  the  Exponentiated  Gradient  algorithm.  Section  3.9  treats 
inference  in  exponential  families.  Section  3.10  introduces  generalized  linear 
regression  problems  and  proves  regret  bounds  for  them.  Finally,  Section  3.11 
gives  some  examples  of  generalized  linear  regression  algorithms,  and  Section  3.12 
concludes. 


3.2  Convex  duality 

For  the  proofs  below  we  will  need  some  definitions  and  basic  results  about 
convex  functions.  A  convex  function  is  any  function  /  from  a  vector  space  X  to 
K  U  {+00,  — oo}  which  satisfies 

A/(x)  +  (1  -  X}f{y)  >  f{Xx  +  (1  -  \)y) 

for  all  x,y  E.  X  and  A  G  [0, 1].  A  strictly  convex  function  is  one  for  which  we 
can  replace  >  by  >  in  the  above  inequality.  A  proper  convex  function  is  one 
which  is  always  greater  than  -oo  and  not  uniformly  +oo.  The  domain  of  /, 
dom  / ,  is  the  set  of  points  where  /  is  finite.  Convex  functions  are  continuous  on 
int  dom  / ,  and  differentiable  on  int  dom  /  except  for  a  set  of  measure  zero.  (The 
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notation  int  C  refers  to  the  interior  of  a  set  (7,  that  is,  the  points  of  C  which 
can  be  surrounded  by  an  open  set  contained  within  C.) 

Some  special  cases  of  convex  functions  are  the  linear  functions,  f{x)  =  a-ar-hb 
for  a  vector  a  and  scalar  6,  and  the  indicator  functions 

5(a;|(7)  =  (  ® 

^  ^  [00  X 

for  a  convex  set  C.  (We  will  sometimes  write  a  predicate  instead  of  a  set  C,  as 
in  (J(xl  =  !)•  There  should  be  no  danger  of  confusion.) 

A  convex  function  /  is  closed  if  its  epigraph  {{x^y)\y  >  f{x)}  is  closed.  The 
closure  of  /,  cl/,  is  the  function  whose  epigraph  is  the  closure  of  /’s  epigraph. 
For  proper  convex  functions,  closedness  is  the  same  as  lower  semicontinuity. 

The  convex  hull  of  a  function  /,  conv/,  is  the  pointwise  supremum  of  all  of 
the  convex  functions  which  are  everywhere  less  than  /.  In  other  words,  conv/ 
is  the  function  whose  epigraph  is  the  convex  hull  of  /’s  epigraph.  The  convex 
hull  always  exists  and  is  convex,  although  it  may  be  the  constant  function  —  oo. 

The  subgradient  of  a  convex  function  at  some  point,  written  df{x),  is  the 
set  of  vectors  a  such  that  f{y)  >  f{x)  ’{-{y  —  x)  -  a  for  all  y.  In  other  words,  the 
subgradient  of  /  at  a?  is  the  set  of  slopes  of  all  tangent  planes  to  /  at  rc.  We  will 
write  domdf  for  the  set  of  x  such  that  df{x)  is  nonempty.  We  have 

int  dom  /  C  dom  df  C  dom  / 

The  subgradient  of  a  smooth  convex  function  /  is  single- valued  on  int  dom/, 
and  df{x)  =  {fix)}  where  f^{x)  stands  for  the  usual  derivative  By  a  slight 
abuse  of  notation  we  will  write  /'  even  when  the  subgradient  is  not  single¬ 
valued;  in  this  case  /'  will  mean  any  (fixed)  function  such  that  f{x)  £  df{x). 
The  rules  for  working  with  subgradients  are  similar  to  the  rules  for  working 
with  derivatives;  in  particular,  d{Xf){x)  =  Xdf{x)  and  d{f  -h  g)(x)  D  df{x)  H- 
dg{x).  We  may  replace  containment  by  equality  in  the  latter  formula  under  mild 
conditions,  for  example  if  relint  dom  /  and  relint  dom  g  have  a  point  in  common. 

For  every  function  /  we  can  define  a  new  function  /*,  called  the  dual  of  /, 
by  the  formula 

/*(a)  =  sup  a  -  x  —  f{x) 

X 

The  notation  sup  denotes  the  supremum  or  least  upper  bound  of  an  expression. 
The  dual  tells  us  how  the  optimal  value  of  a  maximization  problem  changes  if 
we  add  a  linear  function  to  the  objective.  The  dual  is  always  closed  and  convex, 
and  /**  =  cl  conv/.  If  /  >  p  pointwise  then  /*  <  p*. 

For  example,  the  dual  of  exp(a:)  i^x\nx—x.  The  dual  of  —  Ina:  is  —  1— ln(— a;). 
The  quadratic  function  x^/2  is  self-dual.  The  dual  of  \x\  is  (5(a:|[— 1, 1]). 

The  dual  of  kf{x)  is  fc/*(|)-  The  dual  of  a  linear  function  a  ^  x  ^  b  is 
J(x|{a})  --  b.  The  dual  of  /  -h  p  is  /*  Dp*,  where  the  infimal  convolution  uOy 
is  defined  as 

{uQv){x)  =inf{u{x  —  y) -\-v{y)) 
y 
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Figure  3.3:  Generalized  Bregman  divergences. 


A  special  case  is  that,  ii g  =  a- x  +  h,  then  (/  +  g)*{x)  =f*{x-a)-  b.  Another 
special  case  happens  when  we  can  partition  X  into  two  subspaces  Xf  and  Xg  so 
that  f{x)  depends  only  on  the  component  of  x  in  Xf  and  p(x)  depends  only  on 
the  component  of  x  in  Xg.  For  example,  if  we  write  f{x,y)  =  g{x)  +  h{y),  then 
f*{x,y)  =  g*{x)  +  h*{yy,  so  the  dual  of  |x|  +  |2/|  is  (5(x|[-l,  1])  +  (S(2/|[-l,  1]). 

The  subgradients  of  /  and  /*  are  (almost)  inverses  of  each  other.  If  / 
is  strictly  convex,  then  if*)' {fix))  =  x  for  all  x  where  /'  is  defined.  More 
generally,  for  any  closed  convex  function  f,  a  £  dfix)  is  equivalent  to  x  € 
dfia). 

Let  /  be  closed  and  convex.  Prom  the  subgradient  inequality,  we  know  that 

Dfix\y)  =V(a:)  -  fiv)  -{x-y)-  fiy)  >  0 

whenever  fiy)  is  defined.  The  function  Df  is  called  a  Bregman  divergence. 
Some  examples  of  Bregman  divergences  include  squared  Euclidean  distance 
(which  is  Dx-x)  and  information  divergence  (which  is 

Bregman  divergences  can  be  either  symmetric  (like  squared  Euclidean  dis¬ 
tance)  or  asymmetric  (like  information  divergence).  If  /  is  strictly  convex,  then 
Dfix\y)  =  0  is  equivalent  to  x  =  y.  If  g  is  linear,  then  Df+g  =  Df. 

The  Bregman  distances  given  by  /  and  f*  are  strongly  related:  if  /  is  strictly 
convex,  then 

Dfix\y)=Df,ifiy)\fix)) 

If  /  is  not  strictly  convex,  this  equality  may  not  hold:  if  x  is  in  the  middle  of  a 
fiat  spot  of  /,  then  fix)  does  not  uniquely  specify  x. 

This  difficulty  is  a  symptom  of  the  more  general  problem  which  is  illustrated 
in  Figure  3.3:  if  a  point  (y,/(t/))  is  at  a  corner  of  /,  then  there  are  infinitely 
many  possible  tangent  planes  to  /  at  y.  So,  there  are  infinitely  many  possible 
Bregman  divergences  all  represented  by  Dfiz\y). 

One  solution  is  to  pick  a  divergence  arbitrarily  and  fix  £)/  to  mean  just  that 
divergence.  This  solution  is  the  one  we  have  been  using  implicitly  so  far,  since 
we  have  defined  fiy)  to  be  an  arbitrary  but  fixed  element  of  dfiy).  A  better 
solution  is  to  generalize  the  definition  of  Bregman  divergence. 
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We  can  motivate  our  generalization  by  noticing  that,  while  a  point  y  does 
not  define  a  unique  tangent  plane  to  /,  a  slope  a  does.  There  is  always  at  most 
one  plane  with  slope  a  tangent  to  /,  and  if  it  exists  it  is  given  by  the  equation 

mrna))  +  {x-irna))a 

There  is  the  same  ambiguity  in  computing  (/*)'  that  there  was  in  computing 
but  it  doesn’t  matter:  if  df*  is  multivalued,  then  each  value  refers  to  a  different 
point  along  a  linear  segment  of  /,  and  the  tangent  plane  at  any  of  these  points 
is  the  same. 

So,  we  define  the  generalized  Bregman  divergence,  which  measures  the  dis¬ 
similarity  between  a  point  x  and  a  slope  a,  to  be 

]D)/(x|a)  f{x)  f*  (a)  -  X  '  a 

This  definition  is  a  generalization  of  the  original  Bregman  divergence  since,  if 
a  =  /'(2/),  then  Df{x\y)  =  P/(a:|a).  All  of  the  properties  of  Bregman  diver¬ 
gences  given  above  carry  over  straightforwardly  to  D/ . 

Generalized  Bregman  divergences  satisfy  a  simple  symmetry  property:  our 
assumption  that  /  is  closed  implies  that 

lD)/(a:|a)  =  P/*  (a|a:) 

Another  advantage  of  the  new  definition  is  that  P/  (zja)  is  defined  for  any  x  and 
a  (although  it  may  be  infinite)  and  convex  separately  in  x  and  in  a  (although 
it  may  not  be  convex  jointly  in  x  and  a).  By  contrast,  Df{x\y)  is  undefined  if 
dfiy)  is  empty,  and  it  may  not  be  convex  in  y. 

A  function  is  called  positively  homogeneous  if  /(Aar)  =  A/(x)  for  all  A  >  0.  A 
nonnegative,  positively  homogeneous,  closed,  convex  function  is  called  a  gauge. 
Gauges  are  a  generalization  of  norms:  a  norm  is  a  gauge  that  is  symmetric 
{f{x)  =  f{—x))  and  strictly  positive  except  at  the  origin  (/(ar)  =  0  x  =  0). 
The  dual  of  a  gauge  is  an  indicator  function  for  a  convex  set  containing  the 
origin,  and  vice  versa. 

Two  gauges  g  and  are  called  polar  to  each  other  if 
g°(y)  =  inf  {A  >  0|(Vx)  x-y<  Xgix)} 

For  example,  the  Lp  norm  on  is  defined  to  be 


and  II  •  Up  and  \  \-\\q  are  polar  to  each  other  when  ^  H-  i  =  1.  Polar  gauges  satisfy 
a  generalization  of  Holder’s  inequality: 

x-y<g(x)g°iy) 

for  all  x,2/,  with  equality  iff  Ay  €  dg{x)  for  some  A  >  0.  Polarity  between 
gauges  is  related  to  duality  between  convex  functions:  if  f{x)  =  |y(x)^,  then 

/*W  = 

For  more  background  on  convex  duality,  see  [Roc70]  or  [OR70]. 
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3.3  Proof  strategy 

Our  main  result  is  a  bound  on  the  total  regret  of  the  MAP  algorithm.  It  is  stated 
below  as  Theorem  3.1,  and  an  important  specialization  is  given  as  Theorem  3.2. 
There  are  three  basic  steps  in  its  proof  and  application. 

Our  proof  is  by  an  amortized  analysis  [CLR90].  So,  the  first  step  is  to 
define  a  potential  function  for  the  MAP  algorithm.  This  potential  function  will 
decrease  on  trials  where  the  algorithm  suffers  a  large  regret,  and  increase  on 
trials  where  it  suffers  a  small  or  negative  regret.  That  way,  our  analysis  will  be 
able  to  handle  trials  with  large  regret  by  averaging  them  out  against  other  trials 
with  smaller  regret.  This  kind  of  amortized  analysis  is  a  generalization  of  an 
idea  which  was  introduced  in  [LW92]  and  also  used  in  many  other  regret-bound 
proofs. 

The  second  step  is  to  sum  the  regret  over  all  trials.  In  order  to  perform  this 
step,  we  will  introduce  some  constants  that,  roughly  speaking,  summarize  the 
amount  of  information  available  to  the  algorithm  at  the  beginning  of  each  trial. 
These  constants  depend  on  the  type  of  loss  function  we  are  interested  in,  so  we 
will  leave  them  unspecified. 

The  third  and  final  step  is  to  calculate  the  values  of  the  constants  for  the 
specific  algorithms  we  wish  to  analyze.  We  will  leave  this  step  for  subsequent 
sections. 


3.3.1  Existence 

Before  we  prove  any  regret  bounds,  we  will  look  at  when  the  MAP  algorithm 
is  well-defined,  that  is,  when  the  minimum  of  Lt  =  X)i=o  i®  guaranteed  to  be 
attained.  While  it  is  difficult  to  derive  necessary  and  sufficient  conditions  for 
attainment  of  the  minimum,  there  are  some  sufficient  conditions  which  are  easy 
to  check.  Throughout  this  section  (and  the  rest  of  the  paper)  we  will  assume 
that  each  If  is  closed  and  convex.  Because  it  will  avoid  extra  notation,  we  will 
adopt  the  convention  that  any  prediction  is  legal  if  Lt  is  the  constant  function 
+  00. 

The  simplest  sufficient  condition  to  check  is  whether  dom  is  all  of  W, 
since  this  condition  does  not  depend  on  If  for  t  >  1.  Often  this  condition  is 
the  only  one  we  can  check.  Examples  of  functions  that  satisfy  this  condition 
are  lo{w)  =  vP'  and  lo{w)  =  win w.  An  example  of  a  function  that  does  not 
satisfy  this  condition  is  lo{iv)  =  |w»|-  Loosely  speaking,  this  condition  captures 
functions  such  that  the  norm  of  Iq{w)  keeps  increasing  without  bound  as  w 
approaches  the  border  of  dom/o- 

Another  simple  condition  to  check  is  whether  k  attains  its  minimum  for 
each  t.  Examples  of  this  kind  of  function  include  lt{w)  =  (w  -  z)^  and  Zf(p)  = 
LixXrLxipW)-  Linear  functions  (such  as  the  loss  functions  used  in  generalized 
gradient  descent,  described  below)  do  not  usually  satisfy  this  condition. 

If  Zf  is  linear  for  t  >  1,  say  Zf(tu)  =  w  •  Xt,  then  Lt  will  attain  its  minimum 
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exactly  when 


t-i 

Xt  ^  ^  doin  QIq 

i=l 

since  this  condition  is  true  iff  there  is  some  w  so  that 

0  €  dLt{w) 

-Xt  e  dlo{w) 

W  e  dl^i-Xt) 

We  can  combine  and  generalize  these  conditions  into  the  following  lemma: 

Lemma  3.1  Suppose  that  the  functions  convex  and  closed.  Let 

..  he  closed  convex  functions,  each  of  which  attains  its  minimum,  such 
that  It  —  mt  is  convex.  Suppose  there  is  a  point  ut  for  each  t  so  that 

t-i 

-  ^  dom5/o 

i=l 

Then  the  MAP  algorithm  applied  to  the  loss  functions  /o,Zi, . . .  produces  a  legal 
prediction  at  each  trial  t. 

Proof:  Fix  a  trial  t  and  write  Xi  =  {k  —  miy{u;t)  for  1  <  i  <  t.  The  func¬ 
tion  M{w)  =  loiw)  -h  w  •  achieves  its  minimum,  since  it  is  closed 

and  convex  and  since  the  condition  0  G  dlo{u))  is  equivalent  to 

w  €  5Zo(““  ^0* 

The  functions  li{w)  —  mi{w)  —  w  ^  Xi  also  achieve  their  minima,  since  0  G 
d{li  —  mi){(jJt)  —  Xi.  But  Lt  is  the  sum  of  M,  li{w)-mi{w)  —  w*Xi,  and  mi{w)  for 
i  =  \...t  —  l.  So,  since  the  sum  of  closed  convex  minimum-achieving  functions 
is  also  a  closed  convex  minimum-achieving  function,  Lt  achieves  its  minimum. 
□ 


3.3.2  One-step  regret 

Our  potential  function  will  be  a  generalized  Bregman  divergence  involving  the 
comparison  vector  u,  the  loss  functions  It,  and  the  MAP  algorithm’s  current 
prediction  wt-  The  reason  we  use  a  divergence  involving  u  and  wt  is  that  we 
want  to  prove  that,  on  trials  where  the  MAP  algorithm  suffers  a  large  regret 
compared  to  u,  it  will  move  its  next  prediction  closer  to  u.  That  way,  we  can 
conclude  that  if  it  sees  the  same  loss  function  again,  it  will  incur  a  smaller  regret. 

Let  Lt  =  Si=o  chosen  by  the  MAP  algorithm  will  be 

argmin^;  Lt{w).  We  define  our  potential  function  to  be  (u|0).  The  potential 
change  on  each  time  step  is  given  by  the  following  lemma. 

Lemma  3,2  On  trial  t,  the  change  in  potential  is 

(«|0)  -  Di,.  iu\0)  =  lt{u)  -  im  +  Lr+i(0) 
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Proof:  The  potential  on  step  t  is 

BlAu\0)  =  Ltiu)  +  L;{0) 

So,  the  difference  in  potential  from  trial  t  to  t  +  1  is 

iLt+i-Lt)iu)  +  L;^,iO)-Lm 

But  Lt+i  -  Lt  is  just  It,  so  the  result  follows.  □ 

The  function  is  important,  since  it  encodes  both  the  best  possible  loss  so 
far  and  the  MAP  algorithm’s  next  prediction:  Theorem  27.1  in  [Roc70]  states 
that  L*{0)  =  —Lt{wt)  and  Wt  6  dLt{0).  Most  of  the  work  in  applying  The¬ 
orem  3.1  to  specific  problems  will  come  in  analyzing  For  example,  in  the 
Weighted  Majority  proof  below,  Z,j*(0)  will  be  the  log  of  the  sum  of  the  unnor¬ 
malized  weights,  and  the  main  part  of  the  proof  will  be  to  connect  the  change 
in  this  quantity  to  the  algorithm’s  loss. 

3.3.3  Amortized  analysis 

In  order  to  complete  the  proof  of  our  bound,  we  need  to  relate  the  quantity 
(0)  (0)  to  the  loss  of  the  MAP  algorithm.  Since  the  relationship  depends 

on  the  type  of  loss  function  we  are  using,  for  now  we  will  just  assume  that  there 
are  constants  Ci  >  C2  >  , . .  >  0  so  that 

(3.1) 

Here  k  is  some  function  related  to  k.  Often  we  will  just  use  k  =  k,  but  we 
will  sometimes  need  the  extra  generality.  The  smaller  we  take  ct,  the  better  our 
bounds  will  be. 

We  can  think  of  1/ct  as  a  lower  bound  on  how  much  information  is  available 
to  the  algorithm  at  the  beginning  of  trial  t.  The  best  allowable  value  of  Ct  will 
depend  on  how  convex  Lt  is  when  compared  to  k-  For  example,  if  every  L  is 
quadratic  with  the  same  second  derivative,  we  will  show  below  that  we  can  take 
1/ct  proportional  to  the  sample  size  t. 

With  the  assumption  (3.1),  Lemma  3.2  becomes 

(^|0)  -  (w|0)  >  —it{wt)  -  kiu) 

Ct 

or 

him)  <  cthiu)  +  ct%,  (ujO)  -  CtDi.+i  (u|0)  (3.2) 

If  we  now  apply  lemma  3.2  to  trial  t  +  1,  we  get 

/t+i(«;t-i-i)  <  ct^ih+iiu)  -h  (u|0)  -  ct+iD£,+2  (u|0)  (3.3) 

Notice  that  (u|0)  appears  both  in  Equation  3.2  and  in  Equation  3.3,  once 
with  coefficient  -c*  and  once  with  coefficient  ct+i .  Since  C(+i  <  Ct  and  since 
Bregman  divergences  are  nonnegative,  the  two  terms  together  are  less  than  or 
equal  to  zero;  so,  we  can  drop  them  from  our  bound  on  total  regret. 

But  now  we  have  proven 
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Theorem  3.1  Let  — •  satisfy  the  assumptions  of  Lemma  3.1,  so  that  the 
MAP  algorithm  produces  a  prediction  wt  at  trial  t.  Define  Lt  = 
the  constants  Ct  and  the  functions  It  be  such  that  Ct{L^(0)  —  >  lt{wt). 

Then  the  for  all  u  total  regret  of  the  MAP  algorithm  is  bounded  by 

T  T 

^ him)  <  ^thiu)  +  ciD;o  (u|0) 

t=l  t=l 

Proof:  Sum  lemma  3.2  over  all  trials,  then  cancel  terms  as  described  above. 
Finally,  ignore  the  term  containing  the  ending  potential  (tfc|0).  □ 


3.3.4  Specific  bounds 

All  that  remains  is  to  evaluate  the  constants  ct  for  specific  types  of  loss  functions. 
In  the  following  sections  we  will  do  just  that.  The  next  two  sections  analyze  the 
Weighted  Majority  algorithm.  Theorem  3.2,  proved  in  Section  3.7,  covers  cases 
in  which  the  one-step  losses  can  be  represented  as  Bregman  divergences.  In 
particular,  Sections  3.6  and  3.8  cover  generalized  gradient  descent  algorithms, 
Section  3.9  covers  inference  in  exponential  families,  and  Sections  3,10  and  3.11 
cover  generalized  linear  regression  algorithms  including  linear  regression  and 
exponentiated  gradient. 


3.4  Weighted  Majority 

One  of  the  simplest  MAP  algorithms  is  Weighted  Majority,  described  in  [LW92]. 
Here  we  will  analyze  the  versions  which  are  called  WMR  (for  randomized)  and 
WMC  (for  continuous)  in  that  paper. 

WM  is  designed  for  a  problem  called  ‘learning  from  expert  advice.”  In  this 
problem,  the  learner  must  choose  one  of  N  alternatives  on  each  trial — say,  which 
of  N  football  games  to  bet  a  predetermined  amount  on.  We  will  represent  this 
decision  with  a  vector  wt  in  the  unit  simplex  P  =  {^t;  G  |uj  >  0  A  Xli  =  !}• 
Picking  one  of  the  corners  of  the  simplex  means  betting  on  the  corresponding 
game.  Picking  a  vector  in  the  middle  means  either  choosing  a  game  to  bet 
on  at  random  (with  probabilities  Wt)  or  splitting  the  bet  among  the  games 
(with  proportions  wt)^  These  two  interpretations  yield  the  WMR  and  WMC 
algorithms  respectively.  Since  this  is  the  only  difference  between  WMR  and 
WMC,  we  will  analyze  both  algorithms  together  and  use  WM  to  refer  to  either 
one. 

In  either  WMR  or  WMC,  the  learner  then  finds  out  which  bets  paid  off  and 
receives  a  loss  of  wt  '  Xt,  where  Xf^i  is  the  loss  for  betting  on  the  ^’th  game.  (In 
WMC,  the  loss  is  deterministic,  while  in  WMR,  Wf  -xt  is  the  expected  loss.  The 
expectation  is  over  the  learner’s  randomization.)  For  notational  convenience, 
we  will  assume  that  0  <  <  1.  We  assume  that  the  learner  has  no  outside 

information  beyond  the  history  of  losses. 
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The  above  description  of  the  expert  advice  problem  is  a  little  more  general 
than  the  version  in  [LW92].  That  paper  assumes  that  the  learner  is  trying  to 
solve  a  classification  problem.  There  are  N  experts  who  claim  to  know  the 
answer.  The  ith  decision  corresponds  to  agreeing  with  the  ith  expert,  and  xt,i 
is  the  prediction  error  of  the  fth  expert.  This  version  of  learning  from  expert 
advice  is  a  simple  example  of  a  regression  problem  (see  below). 

To  solve  the  expert  advice  problem,  WM  follows  a  simple  strategy.  Whenever 
an  expert  makes  a  mistake  (i.e.,  has  a  loss  of  1),  WM  reduces  that  expert’s 
weight  by  a  constant  factor  /3  €  (0,1),  then  renormalizes  to  keep  the  sum  of 
the  weights  equal  to  1,  Experts  with  losses  less  than  1  have  their  weights 
reduced  less.  More  specifically,  define  Xt  =  Write  vt,i  =  .  Let 

Then  WM  predicts  Wt  =  ZtVt-  (Actually,  [LW92]  allows  some 
flexibility  in  choosing  Xt,  but  this  is  one  of  the  allowed  choices.) 

To  design  a  MAP  algorithm  for  learning  from  expert  advice,  we  just  need 
to  pick  a  prior  loss  function  Iq,  since  we  already  know  lt(w)  =  w  ■  xt  ioi  t  >  1. 
In  order  to  make  sure  that  our  predictions  are  always  in  the  unit  simplex  P, 
we  will  set  loiw)  =  oo  for  w  ^  P.  A  reasonable  choice  of  Iq  for  id  e  P  is  some 
multiple  of  the  entropy  function,  making 

loiw)  a  H(w) 

H (w)  (5(u)|P)  +  ^  Wi  In  Wi  (3.4) 

i 

It  is  easy  to  verify  that 

H*ix)  =  ln^exp(xi) 

i 

-^H*ix)  =  exp(a;j)/^exp(a;j) 

To  duplicate  WM,  we  will  pick  /q  =  ^r^H{w).  (This  choice  of  lo  means 
that  wi  will  be  at  the  center  of  P;  it  is  easy  to  accomodate  other  starting  vectors 
by  adding  a  linear  function  to  lo  to  move  its  minimum  to  the  desired  wi.)  Then 

iS(®)  =  ;^^f*(-a;ln/3) 
iliyix)  =  iH*n-x\nl3) 

Furthermore,  since  Ltiw)  =  loiw)  -\-XfW,  we  have  LUx)  =  llix-  Xt).  That 
means  that  our  prediction  on  step  t  will  be  wt  =  ilo)'i-Xt)  ^  iH*)'iXtlxi^), 
or 

J 

vt,i  = 

which  is  identical  to  the  prediction  of  WM, 
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Now  that  we  have  expressed  WM  as  a  MAP  algorithm,  we  can  analyze  it 
by  applying  Theorem  3.1.  To  do  so,  we  must  compute  the  constants  c*.  It  is 
easy  to  see  that  taking  ct  =  —ln/3/{l  -  ^)  for  all  t  satisfies  the  assumptions  of 
Theorem  3.1,  since  we  can  write 


< 


H*{Xt+i  \nl3)-H*iXtln0) 

iH - 

In 


ln^(l  -  {I  -  ^)xt,i)wt,i 
i 

ln(l  -  (1  -  /3)xt  •  wt) 

-(1-P)xt  'Wt 


The  first  inequality  holds  because  <  1  —  (1  —  p)x  for  ^  >  0  and  x  €  [0, 1], 
while  the  second  holds  because  ln(l  —  x)  <  —x.  So  now  we  have  proven 


Corollary  3.1  The  loss  of  WM  with  parameter  /?  is  bounded  by 


^xfWt  <  •“  + 


Proof:  Apply  Theorem  3.1  with  It  =  h  and  ct  =  --ln/3/(l  —  ^0).  Then  replace 
lo  by  □ 

If  we  now  note  that  ID)jj(n|0)  <  InAT  for  all  u  £  P,  the  above  result  is 
identical  to  Corollary  6.1  in  [LW92]. 

3.5  Log  loss 

In  step  t  of  Weighted  Majority  the  learner  is  charged  the  loss  xt'Wt,  where  xt^  is 
the  loss  of  the  zth  expert.  For  some  problems  it  may  be  more  appropriate  to  use 
the  loss  function  kiw)  =  —\n{yt  •  w)  for  some  vector  yt  instead.  Two  examples 
are  the  portfolio  selection  problem  and  the  mixture  estimation  problem. 

In  the  portfolio  selection  problem,  the  learner  is  presented  with  N  invest¬ 
ments  on  each  time  step.  After  the  learner  chooses  what  fraction  of  its  fortune 
to  invest  in  each  alternative,  investment  i  grows  by  a  factor  of  yt,i.  So,  if  the 
learner  puts  a  fraction  wt^i  in  each  investment,  its  total  wealth  grows  by  a  factor 
of  Wt  '  yt-  Since,  in  our  framework,  we  combine  losses  from  different  trials  by 
adding  them,  we  need  to  take  the  log  of  the  wealth  changes.  That  way  the  total 
of  the  log  wealth  changes  will  be  the  log  of  the  total  wealth  change.  Since  losses 
are  the  negative  of  gains  that  leaves  us  with  the  penalty  --  ln{wt  -  yt)- 

In  the  mixture  estimation  problem,  the  learner  must  discover  the  coeflScients 
in  a  mixture  of  N  probability  distributions.  After  choosing  mixture  coefficients 
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Wt,  the  learner  receives  a  new  training  example  and  computes  the  probability 
yt,i  assigned  by  the  ith  probability  distribution  to  the  new  example.  Since  we 
want  to  maximize  likelihood,  or  equivalently  minimize  negative  log  likelihood, 
we  charge  the  learner  a  loss  of  -  ln(«;t  -yt). 

If  we  write  xt,i  =  we  can  run  WM  with  the  vectors  xt-  In  other 

words,  we  can  compute  Wt  according  to  the  equations 

0 

vt,i  = 


z— 1 

We  will  call  the  resulting  algorithm  WM-log,  even  though  it  has  the  exact  same 
series  of  computational  steps  as  WM,  to  emphasize  that  we  want  to  prove  bounds 
on  its  log  loss  J]jln(«;(  •  yt).  Just  as  before,  we  will  assume  that  xt,i  €  [0,1] 
for  notational  convenience.  It  turns  out  that  WM-log  is  a  special  c^e  of  the 
Aggregating  Algorithm  of  [Vov90]. 

The  WM-log  update  has  a  particularly  simple  interpretation  in  the  portfolio 
selection  problem.  If  we  let  ??  =  1  (so  that  /?  =  1/e),  then  the  fraction  of 
money  in  the  fth  investment  at  step  t  is  exp(-A't.i)/X;j  exp(-A't,i).  The  rate 
of  ^owth  on  step  t  is  wt -yt  =  exp(-A’t+i,i)/  Ei  exp(-Xt,i),  SO  we  can  prove 
by  induction  that  a  fortune  of  $A^  on  step  1  grows  to  a  fortune  of  exp(-Xt,i) 
dollars  on  step  t.  So,  the  amount  of  money  in  the  ith  investment  on  step  t  is 
just  exp(— dollars.  But  this  is  exactly  the  amount  which  would 
be  in  the  zth  investment  if  we  had  just  invested  $1  in  each  investment  on  step  1 
and  let  it  sit.  And,  in  fact,  the  bound  which  we  will  prove  below  is  equivalent 
to  the  obvious  observation  that  investing  $1  in  each  investment  earns  at  least 
\/N  times  as  much  as  investing  %N  in  the  best  investment. 

To  analyze  the  WM-log  algorithm  we  will  compare  our  performance,  not 
to  the  best  vector  u  €  P,  but  only  to  the  best  individual  expert  (i.e.,  the  best 
corner  of  P) .  Write  P  for  the  set  of  corners  of  P.  Then  if  we  write  mt{w)  =w -Xt , 

hM  -  h{u)  <  mt{w)  -  mt{u)  Vu  e  P^w  £  P 

since  mt  touches  It  at  each  corner  of  P  but  lies  above  h  elsewhere  in  P, 

If  we  now  run  the  MAP  algorithm  with  loss  functions  then 

the  analysis  of  the  Section  3.4  shows  that  our  predictions  will  be  identical  to 
WM-log  with  learning  rate  ??  =  1.  Furthermore,  with  Lt  =  H  ^  mt,  we 
have  * 

(0)  =  ln^exp{-xt,i)wt,i  =  Iny*  •  =  -lt{wt) 
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So,  we  can  apply  Theorem  3.1  to  the  loss  functions  iJ,  mi ,  m2, . . .  with  =  1 
and  it  =  It-  The  result  is  that 

T  '  T 

^mt{wt)  <  +  P^(i/|0)  \/ue  P 

t=i  t=i 

With  the  substitutions  mt{wt)  =  and  Dij (w|0)  =  IniV,  this  becomes 

T  T 

t=l  t=l 

This  bound  is  equivalent  to  Equation  (3.4)  in  [HKW98].  (That  equation  refers 
to  a  constant  cl,  which  plays  the  same  role  there  that  1/t]  does  here,  and  which 
is  set  to  1  for  the  analysis  of  the  WM-log  algorithm.) 

So,  for  the  WM  and  WM-log  algorithms,  our  regret  bounds  are  the  same 
as  the  bounds  previously  obtained  in  the  literature.  As  we  would  hope  for  a 
general  framework  for  regret  bounds,  once  we  set  up  WM  and  WM-log  as  MAP 
algorithms,  their  proofs  are  similar:  we  evaluate  the  constant  c  =  Ct  and  apply 
Theorem  3.1.  We  can  follow  a  similar  strategy  for  the  other  MAP  algorithms 
described  below.  Since  some  of  these  proofs  are  more  complicated,  we  will  collect 
some  of  the  overlap  into  Theorems  3.2  and  3.3. 


3.6  Generalized  gradient  descent 

In  the  previous  two  sections  we  analyzed  simple  MAP  algorithms  in  which  all 
of  the  loss  functions  except  the  prior  were  linear.  In  the  first,  the  loss  functions 
started  out  linear,  while  in  the  second,  we  bounded  the  true  loss  functions  by 
a  linear  approximation.  Because  of  the  linearity  of  the  loss  functions,  it  was 
easy  to  compute  the  prediction  wt  on  each  time  step:  the  update  rules  for  WM 
and  WM-log  are  both  of  the  form  Wt  =  f{—rjXt),  where  Xt  is  the  sum  of  the 
gradients  of  the  previous  loss  functions  and  /  is  a  function  that  we  can  compute 
efl[iciently. 

We  would  like  to  be  able  to  play  the  same  trick  for  an  arbitrary  convex  loss 
function  It.  That  is,  we  would  like  to  bound  It  by  a  linear  function  m^,  then 
apply  the  MAP  algorithm  to  the  functions  mt  instead  of  k  so  that  it  will  run 
more  efficiently.  Of  course,  the  predictions  will  be  different  if  we  use  mt  in  place 
of  Zf,  and  so  the  regret  may  be  larger.  But,  we  may  have  to  do  significantly  less 
work  per  trial,  and  we  will  still  be  able  to  bound  the  regret. 

The  key  inequalities  which  allowed  us  to  replace  It  by  mt  in  the  previous 
section  were 

kiwt)  <  mtiwt) 
lt{u)  >  mtiu)  'iu£U 

IfU  —  W  then  these  inequalities  force  mt  to  be  tangent  to  Zt  at  ;  if  ZY  C  W  then 
mt  niay  be  a  secant  to  k  that  passes  above  (wtJtiwt)).  Subtracting  the  second 
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Name 

Link  (/q)' 

Loss  lo 

Identity 

a 

Logistic 

1 

l+exp(— a) 

In  UJ  4-  (1  —  ty)  ln(l  —  li;) 

Inverse  logistic 

In^^ 

ln(l  +  exp(i/;)) 

Exponential 

exp  a 

wlnw  —  w 

Logarithmic 

In  a 

expw 

Normalized 

exponential 

exp  m 
exp  ai 

Y,i  Wi\nwi-l  +  (5(u;|  Wi  =  1) 

Figure  3.4:  Some  examples  of  link  functions. 


inequality  from  the  first  gives  lt{wt)  -  lt{u)  <  mt{wt)  -  m((w)  for  all  ueU,  so 
that  when  we  apply  Theorem  3.1  to  bound  the  diflFerence  mt{wt)  -  mt{u)  we 
also  get  a  bound  on  the  regret  kiwt)  —  lt{u). 

In  the  previous  section  we  achieved  Equation  3.5  by  restricting  li  to  the 
corners  of  the  unit  simplex,  even  though  wt  was  allowed  to  range  over  the  entire 
simplex.  In  general  we  want  to  set  U  to  the  range  of  wt,  and  in  this  case  the 
only  suitable  linear  functions  mt  are  those  which  are  tangent  to  It  at  Wt- 

ff  we  set  mt  to  be  atangent  to  k  at  wt,  mt{w)  =  lt(wt)  +  (w-wt)-ltiwt),  and 
then  feed  the  sequence  of  loss  functions  lo,mi,m2,. . .  to  the  MAP  algorithm, 
the  result  is  an  algorithm  called  generalized  gradient  descent  or  GGD.  It  is 
“generalized”  because,  when  Iq  is  quadratic,  the  update  rule  reduces  to  ordinary 
gradient  descent.  We  can  write  the  GGD  update  rule  as  follows: 

GGD  Algorithm:  Predict  wt  e  argmin^  +  w  ■  . 

The  GGD  algorithm  is  often  written  in  an  additive  form  that  looks  different 
from  its  statement  above.  If  we  write  Xt  =  I'iiwi)  then  the  additive  form 
of  the  GGD  prediction  rule  is  Wt  =  f{—T}Xt),  Here  is  a  learning  rate  and 
/  is  a  function  from  R”  to  R”  satisfying  appropriate  conditions.  For  example, 
choosing  /  to  be  the  identity  yields  ordinary  gradient  descent.  The  advantage 
of  this  form  of  the  prediction  rule  comes  from  the  fact  it  may  be  difficult  to 
compute  lo  from  /,  while  it  is  often  easier  to  compute  /  from  Iq;  so,  if  we  are 
given  /,  we  can  use  the  additive  form  of  the  GGD  rule  without  needing  to 
compute  Iq. 

We  can  prove  that  the  two  forms  of  the  GGD  algorithm  are  equivalent:  if 
jj  =  1,  then  we  can  set  /  =  (ZJ)'.  For  different  learning  rates  we  can  just  multiply 
lo  by  a  constant,  since  i^lo)*ix)  =  ^l^irix)  and  so  ((^/o)*)'(x)  =  mYivx), 

The  function  f{x)  (or  equivalently  (/o)'(f ))  is  called  a  link  function.  Fig^ 
ure  3.4  shows  some  useful  link  functions  and  their  corresponding  loss  functions. 
The  one-dimensional  link  functions  in  Figure  3.4  can  easily  be  generalized  to 
multiple  dimensions  by  applying  them  separately  to  each  coordinate. 

Some  examples  of  GGD  algorithms  are  ordinary  gradient  descent,  the  per- 
ceptron  learning  rule,  and  the  Exponentiated  Gradient  algorithm  of  [KW97]. 
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We  will  examine  some  of  these  algorithms  in  more  detail  below.  But  first,  we 
will  prove  regret  bounds  for  a  class  of  algorithms  that  includes  GGD. 

3.7  General  regret  bounds 

3.7.1  Preliminaries 

In  many  common  MAP  algorithms,  each  individual  loss  function  can  be  written 
as  a  Bregman  divergence.  For  example,  in  linear  regression,  the  loss  functions 
are  of  the  form  {yt  —  Wt  •  which  we  may  think  of  as  a  scaled  Euclidean 
distance  between  wt  and  any  of  the  infinitely  many  perfect  predictions  w  sat¬ 
isfying  yt  —  w  -  xt,  (The  scaling  is  such  that  all  directions  perpendicular  to 
xt  have  weight  zero.)  For  a  more  general  example,  in  GGD,  if  we  adopt  the 
convention  that  inf^ltiw)  =  0,  then  the  loss  mt{wt)  is  lt{wt)  =  B/*  Or, 

for  another  example,  in  inference  of  the  natural  parameter  in  an  exponential 
family,  we  will  see  below  that  the  appropriate  loss  function  is  Bi{wt\at)  for  a 
fixed  /.  In  this  section  we  will  derive  regret  bounds  that  hold  when  the  loss 
functions  are  divergences. 

To  that  end,  assume  that  we  are  running  the  MAP  algorithm  with  loss 
functions  •  •  •?  and  that  mt{wt)  =  Also  assume  mt{w)  < 

B/^  (vj\at)  for  all  w.  (These  inequalities  are  a  tangency  condition  similar  to  (3.5).) 
Write  Lt  =  lo  +  Z)i=i  This  notation  is  similar  to  the  notation  from  the 
section  on  GGD,  but  in  this  section  we  are  not  assuming  that  the  functions  mt 
are  linear.  In  particular,  we  may  take  mt  =  It- 

In  order  to  bound  the  loss  of  the  MAP  algorithm,  we  have  to  make  sure  that 
the  prior  loss  Lt  before  each  trial  t  is  sufficiently  convex.  To  see  why,  consider 
what  would  happen  if  we  took  Iq  =  Li  to  be  “5(i(;|[0, 1]).  With  this  choice  of 
prior  loss,  our  predicted  w  can  change  discontinuously  from  0  to  1  even  when 
the  one-step  loss  has  only  a  small  gradient.  So,  for  example,  if  we  see  mi  =  w/2 
and  then  m2,  m3, ...  =  {1  —  w),w,  {1  —  w),w, . . our  predictions  will  alternate 
between  0  and  1  no  matter  how  small  7/  is.  In  fact,  we  will  always  choose  the 
worst  possible  w,  and  so  our  loss  will  be  twice  that  of  the  comparison  vector 
u  =  .5. 

We  also  have  to  make  sure  that  the  one-step  divergence  functions  It  for 
t  >  1  are  not  too  convex.  If  they  are,  we  can  cause  the  MAP  algorithm  to 
suffer  an  arbitrarily  large  regret  per  trial:  the  more  convex  k  is  as  compared  to 
Li,  the  more  of  an  advantage  it  is  to  pick  the  comparison  vector  after  knowing 
mt.  For  example,  if  /o(^^)  =  (so  that  tyj  =  0),  then  the  loss  function 
mi{w)  =  10^ {w  —  1)^  will  cause  the  MAP  algorithm  a  loss  of  10^,  while  the 
optimal  comparison  vector  will  suffer  a  loss  of  approximately  even 

though  its  /o’-divergence  from  wi  is  less  than  1. 

So,  to  ensure  that  Lt  is  sufficiently  convex,  we  will  pick  a  gauge  g  and 
constants  r}t  €  (0,1)  and  require  that 

T]t^LA‘v\a)  >  ^{giv-w)? 
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for  all  V  and  w  and  a  G  dLt{w).  And,  to  ensure  that  It  is  not  too  convex,  we 
will  require  that 

(atk)  >  -  a)f 

for  all  w  and  a  G  dlt{w). 

A  consequence  of  the  first  assumption  is  that 

Lt{w)  ~  Lt{wt)  >  -  wtyf 

since  the  LHS  is  equal  to  (it;|0)  and  0  G  dLt{wt)>  A  consequence  of  the 
second  assumption  is  that 

as  long  as  dmt{wt)  is  nonempty,  since 

mt{wt)  =  lt{wt) 

=  V)i^{wt\at) 

for  any  a  6  dlt{wt),  and  since  dlt{wt)  —  at  D  dmt{wt). 

Scaling  the  gauge  g  will  scale  rjt  inversely.  So,  in  order  to  make  the  constant 
gt  as  small  as  possible  in  the  first  assumption,  it  is  important  to  take  p  to  be  as 
shallow  as  possible  while  still  satisfying  the  second  assumption. 

3.7.2  Examples 

To  interpret  our  assumptions,  it  will  help  to  compute  the  best  gauge  g  and  learn¬ 
ing  rate  g  for  some  examples.  First  suppose  that  Lt  and  k  are  both  quadratic, 
say  Lt(w)  =  ^w'^Mw  and  lt(w)  =  ^w^Mw  for  some  symmetric  positive  defi¬ 
nite  matrix  M.  (This  choice  of  It  means  that  mt{wt)  —  \{wt  —  zt)'^M{wt  —  zt), 
where  Zt  =  M~'‘-at.)  Then  we  can  choose  %  =  |  and  g{w)  =  y/w^Mw,  since 

(u|o)  =  ^iv-  wfM{v  -w)  =  ^g{v  -  wf 
where  w  =  {kM)~^a,  and 

HI*/,*  iat\tv)  =  —aj M  ^ot  H-  Mw  —  ot  -w  =  —g°{at  —  a)^ 
where  a  =  Mw. 

Or  suppose  that  It  is  quadratic  but  Lt  is  proportional  to  the  entropy  function 
H  defined  in  Equation  3.4,  In  particular,  let 

ki'^)  =  llMl 
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It  is  well  known  that  Dh{v\w)  >  2\\v  —  w\\2  for  any  v,w.  So,  MDi,iv\w)  > 
|||i?  —  w\\2-  And  just  as  in  the  previous  example  P/*  {at\w)  >  ^\\at  —  wWl^  So, 
we  can  choose  g  to  be  Euclidean  distance  and  let  rjt  = 

These  two  examples  show  that  g  and  rj  together  provide  a  global  analog  to 
the  Fisher  information  matrix.  When  the  Fisher  information  L”{w)  is  constant 
over  all  possible  parameter  values  w,  as  it  is  in  the  first  example,  the  local  and 
global  information  measures  are  the  same.  On  the  other  hand,  when  the  Fisher 
information  varies,  as  it  does  in  the  second  example,  the  global  measure  may 
be  much  more  conservative.  This  conservatism  is  necessary:  in  the  average  case 
we  can  count  on  having  our  estimates  stay  near  the  optimal  value,  while  in  the 
worst  case  our  opponent  can  cause  our  estimates  to  wander  into  a  region  with 
lower  information. 

Finally,  suppose  that  lt{w)  =  and  let  =  0  so  that  mt{wt)  = 

This  choice  of  loss  function  is  appropriate  for  linear  regression  problems. 
It  depends  on  w  only  through  w  •  Xt-,  so  any  change  in  w  perpendicular  to 
xt  leaves  k  constant.  That  means  that  we  can  represent  as  the  sum  of  two 

components,  one  of  which  depends  only  onw^xt  and  the  other  of  which  depends 
def 

only  on  w\xt  =  w  -  A  little  algebra  shows 


ll{x)  =  (5(x|a;  \a;t  =  0)  +  ^  ^ 

Xt  ’  Xf  L 
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In  other  words,  ^  is  infinite  everywhere  except  along  the  line  through  xt,  and 
along  that  line  it  is  quadratic.  The  quadratic  term  (the  last  term  in  the  expres¬ 
sion  above)  is  scaled  so  that  it  is  equal  to  |  at  xt  and  —Xt.  So,  to  bound  we 
will  need  to  make  some  assumption  about  xt* 

If  we  suppose  that  the  gauge  g  is  symmetric  and  scaled  so  that  g°{xt)  <  1, 
then  it  is  not  hard  to  see  that  P;*  (0|m;)  =  lt{w)  >  \{g°{x))^,  since  the  latter 
expression  is  also  quadratic  along  the  line  through  Xt  and  scaled  so  that  it  is  no 
larger  than  |  at  ±xt.  So,  for  example,  if  ||a^t||oo  <  A',  we  can  take  g{w)  to  be 

Now,  since  ||tt?||i  <  ||u^|l2j  we  have  Dh(v\w)  >  2\\v  —  'w\\l.  So,  if  Lt{w)  = 
fcPif  (t/;|0),  we  can  take  gt  == 


3.7.3  The  bound 

We  will  now  prove  our  regret  bound. 

Theorem  3.2  Suppose  that  the  loss  functions  lo,Tni,m2,  ^  ^  •  satisfy  the  con¬ 
ditions  of  Lemma  3.1,  so  that  the  MAP  algorithm  applied  to  these  loss  func¬ 
tions  always  produces  a  prediction  wt  at  each  trial  Suppose  that  for  all  t, 
dmt{wt)  is  nonempty,  mt{wt)  =  and  mt{w)  <  Bi^{w\at)  for  all  w. 

Write  Lt  —  h  +  •  Suppose  that  there  exists  a  gauge  g  and  constants 
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1  >  ^1  ^  772  ^  >  0  50  that  for  all  t  we  have 

nt^L,  (f  |a)  >  ^{g{v  -  w))^ 
for  all  V  and  w  and  a  £  dLt{'w)  and 

»/r(atH>i(5°(at-a))2 

for  all  w  and  a  £  dlt{w).  Then  the  loss  of  the  MAP  algorithm  is  bounded  by 

^  ^  1  j 

Ed  —mt{u)  +  — — (u|0) 
t=i  i=i  i  Vi 

Proof;  We  have 

-  wt)f  +  Lt{wt) 

mtiw)  >  mt{wt)  +  {w  —  wt)  ■  m[{wt) 

Lt+i{w)  >  -^{giw  -  wt)f  +  {w  -  wt)  ■  m't{wt)  +  Lt{wt)  +  rnt{wt) 

4+1 W  <  ■^{9°{nt{x-nn'^{wt)))Y +x-wt- Lt{wt)-mt{wt) 

4*+i(0) -4(0)  <  -^{9° {-mrn[{wt))f  -  rnt{wt) 

<  (rjt  -  l)mt{wt) 

The  fourth  line  above  is  true  because  the  dual  of  o/(ty  -  c)  +  6  •  (w  -  c)  is 
-  6)/a)  +  X  •  c.  The  fifth  is  true  because  4(0)  =  The  last  line 

is  true  because  9° 

The  desired  result  now  follows  by  applying  Theorem  3.1  to  the  loss  functions 
io,mi,m2, . . taking  =  mt  and  Ct  =  □ 

The  way  it  is  stated,  this  theorem  bounds  the  loss  in  terms  of  the  functions 
mt‘,  it  is  just  as  easy  to  give  a  bound  in  terms  of  It  by  substituting  mt{wt)  = 
iwt\at)  and  mt{u)  <  (u|ot). 


3.8  GGD  examples 


Perhaps  the  simplest  use  of  GGD  is  to  approximate  the  mean  of  a  population  of 
vectors  by  looking  at  a  sample  21,  ^2, . . ..  This  application  of  GGD  corresponds 
to  the  prior  loss  lo(w)  =  fc|(i(;||^  and  the  one-step  losses  lt{w)  =  \\w-zt\\^.  With 
these  loss  functions,  GGD  will  predict  tut+i  =  wt  +  \{zt  -  Wt).  We  saw  above 
that  we  can  take  g  to  be  Euclidean  distance  and  tj  =  so  Theorem  3.2  tells  us 
that  our  loss  is  bounded  by 
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where  z  =  Ylt=i  is  the  optimal  constant  prediction. 

The  first  term  on  the  right-hand  side  of  the  above  inequality  depends  on  the 
training  examples  zt  only  through  their  variance;  the  second  depends  on  the 
examples  only  through  their  mean.  So,  the  inequality  tells  us  that  even  if  the 
training  examples  are  chosen  by  an  adversary,  as  long  as  they  have  bounded 
mean  and  variance,  we  can  still  achieve  bounded  average  regret  per  trial.  More 
specifically,  suppose  that  a.sT  oo  the  mean  of  zi  ..,zt  approaches  //  and  the 
covariance  approaches  a^L  Then  for  large  enough  T  the  second  term  becomes 

negligible,  and  our  average  loss  per  trial  will  approach  So,  our  average 

2 

regret  per  trial  will  approach 

By  way  of  comparison,  we  can  compute  the  asymptotic  average  case  regret 
per  trial  for  this  variant  of  GGD:  suppose  that  the  training  examples  zt  are 
independent  indentically  distributed  random  variables  that  follow  a  normal  dis¬ 
tribution  with  mean  (jl  and  covariance  cr^/.  Then  the  optimal  prediction  will  ap¬ 
proach  ^  for  sufficiently  large  T,  and  its  expected  loss  on  each  trial  will  approach 
cr^.  On  the  other  hand,  by  solving  the  recurrences  Ewt^i  =  0-~  +  vEzt 

and  Vaxwt^i  =  (1  —  r])^Yaxwt  +  77^  Varz^,  we  can  see  that  Ewt  //  and 
Yaiwt  expected  loss  per  trial  of  the  GGD  algorithm  ap¬ 

proaches 

E\\zt  -  ng  +  Ey  -  wtg  -^^2(1  +  ^ 

z  7]  ^2 

2 

and  the  average  regret  per  trial  approaches  That  means  that  as  77  — ^  0 
there  is  a  difference  of  approximately  a  factor  of  two  between  the  worst-case 
and  average-case  regret  for  this  algorithm.  This  gap  appears  to  be  necessary: 
at  least  for  small  learning  rates,  the  sequence  01,  >2:2,  •  • .  =  1,  1,  —1,  •  •  •  forces 

nearly  as  much  regret  as  our  bound. 

For  another  example,  take  Iq  to  be  a  multiple  of  the  entropy  function  on 
the  unit  simplex.  That  is,  suppose  /o(^^)  =  }^Eh{w\0),  with  H  defined  in 
Equation  3.4.  The  resulting  update  is 

wt,i  exp{-xt4/k) 

exp{-xt,i/k) 

where  Xt  =  This  is  the  Exponentiated  Gradient  algorithm  of  [KW97]. 

(If  the  loss  functions  It  for  t  >  1  are  linear,  it  is  also  the  same  as  the  WMC 
algorithm.) 

If  now  ==  —  zt)"^  for  t  >  1,  we  saw  above  that  we  can  take  rj  = 

So  Theorem  3,2  tells  us  that  our  loss  is  bounded  by 

T  T 

£  Ikt  -  ^\\u-  («|0) 

t=l  ^  Ak  t=l  ^ 

for  any  u.  This  bound  is  not  the  same  as  any  bound  in  [KW97]  or  [KW96], 
since  those  papers  consider  only  regression  problems;  so,  we  defer  a  comparison 
until  Section  3T1  belowt 
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3.9  Inference  in  exponential  families 


The  MAP  algorithm  requires  solving  a  minimization  problem  to  find  each  pre¬ 
diction  wt.  If  the  loss  functions  are  arbitrary,  the  minimization  problem  may  be 
difficult.  Suppose,  though,  that  It  has  the  same  functional  form  for  each  t — say, 
lt{w)  =  P<(tn|ot)  for  some  fixed  strictly  convex  function  1.  (By  convention  we 
take  I  so  that  ao  =  0.)  Then,  as  we  will  show  shortly,,  we  will  always  be  able  to 
put  the  optimization  problem  into  a  simple  form. 

One  situation  where  this  kind  of  prediction  problem  might  arise  is  when 
the  vectors  o*  are  samples  from  some  target  distribution.  Our  goal  in  this 
case  is  to  predict  wt  so  that  l'{wt)  is  as  close  as  possible  to  the  center  of  the 
distribution,  where  centrality  is  defined  by  the  divergence  .  As  we  will  see 
in  Section  3.9.2,  this  definition  of  centrality  is  a  good  one  if  we  are  trying  to 
infer  the  natural  parameter  in  an  exponential  family  of  distributions  (hence  the 
title  of  this  section).  Unlike  the  standard  statistical  approach,  though,  we  are 
making  no  distributional  assumptions  about  the  vectors  at:  they  need  not  be 
identically  distributed,  independent,  or  even  random. 

In  more  detail,  our  optimization  problem  at  step  t  is  to  find 

t-i 

argminy^/Aic) 

It; 

i=0 


Define  Lt  =  Yfi=o  so  that  our  problem  is  to  minimize  Lf.  Then  the  prediction 
of  the  MAP  algorithm  will  be 


argminLt(if;)  =  argmin  +r(oi)  -wai] 

i=0 

t-i  ' 

ti{w)  -  w  • 


arg  mm 


=  arg  min 


w  e  dr  (at) 


i=0 


1 

/(w)  -w. 


i=:0 


where  we  have  defined  at  to  be  the  mean  of  oq  . . .  o«_i .  In  other  words,  the  MAP 
algorithm  has  a  simple  implementation:  to  make  our  prediction,  we  compute 
the  average  of  all  the  samples  ot  seen  so  far,  then  apply  (I*)'  to  this  average. 

The  implementation  is  almost  the  same  if  we  take  Iq  —  noD;  (u;|oo)  for  some 
multiplier  no-  In  that  case,  the  prediction  is  a  weighted  average  of  Oo  . . . Ot_i  in 
which  oo  gets  no  times  as  much  weight  as  any  of  the  other  OjS. 


3.9.1  Regret  bounds 

In  our  current  inference  problem,  Lt  and  k  each  differ  from  a  multiple  of  I  by 
a  linear  function.  So,  in  order  to  apply  Theorem  3,2,  we  must  show  that  I  is 
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neither  too  convex  nor  too  shallow.  In  other  words,  we  must  find  a  gauge  g  and 
constant  k  so  that 

for  all  V  and  w  and  a  G  kdl{w)  and 


©(.  (at|w)  >  ^{g°{at  -  a))^ 


for  all  w  and  a  G  dl{w). 

Under  these  assumptions,  we  can  apply  Theorem  3.2  with  Iq  =  noID)/('u;|0), 
lt{w)  =  7nt{w)  =  for  t  >  1,  and  7]t  =  make  sure  that 

<  1  we  must  take  no  >  k.)  The  result  is  that 

T  T  . 

B/  {wt  |a()  <  Y  1 +  1  (“|0) 

I""?! 

If  lt{u)  is  bounded  for  t  larger  than  some  to,  then  the  first  term  on  the  right 
hand  side  is  0(^  +  Int)  as  i  ->  00.  This  is  the  same  asymptotic  behavior  as  the 
average-case  regret,  although  the  constant  in  front  of  In^  will  usually  be  smaller 
for  average-  than  for  worst-case  bounds. 

The  constant  k  will  be  equal  to  1  only  if  I  is  quadratic.  However,  if  the 
predictions  wt  remain  in  some  region  W  for  sufficiently  large  t,  for  those  t 
we  can  take  g  and  k  as  bounds  on  the  convexity  of  I  just  within  W  instead 
of  globally.  This  trick  may  result  in  better  asymptotic  bounds  in  some  cases. 
Even  with  this  trick  the  bounds  may  not  be  very  tight:  for  example,  it  does  not 
appear  to  be  possible  to  prove  bounds  of  the  form  obtained  in  [Fre96]  using  this 
strategy. 

3.9.2  A  Bayesian  interpretation 

We  have  just  proved  worst-case  regret  bounds  for  a  special  case  of  the  MAP 
algorithm.  Interestingly,  we  can  also  justify  the  same  algorithm  with  an  average- 
case  argument.  (For  background  see  [BN78].)  Suppose,  just  as  before,  that  our 
loss  on  step  t  is  Suppose  now,  though,  that  each  at  is  an 

independent  sample  from  some  known  distribution.  To  ensure  that  the  loss  is 
finite,  we  will  require  at  to  be  in  domZ*  w.p.l. 

In  particular,  suppose  that  the  distribution  of  at  has  the  form 

g>{a\6)  =  exp(^  •  a  —  k{6)  —  ^(a)) 

for  some  parameter  vector  6  and  fixed  functions  k  and  (j).  (Such  a  set  of  distri¬ 
butions  is  called  an  exponential  family,  and  $  is  called  its  natural  parameter.) 
Suppose  also  that  our  prior  distribution  for  6  has  the  form 

?/(^|Ao,no)  =  exp(Ao  *  0  -  no^(^)  -  x(Ao,no)) 
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for  parameters  Aq  and  no,  where  the  function  x  is  determined  by  the  requirement 
that  the  density  must  integrate  to  1.  (This  distribution  is  called  the  conjugate 
prior  for  /^,  and  it  is  also  an  exponential  family.)  We  will  see  below  that  choos¬ 
ing  k  =  /  has  an  intuitive  interpretation,  but  other  choices  of  k  may  also  be 
reasonable. 

Then  the  log  posterior  likelihood  after  seeing  t  samples  will  be 

6  ■  \  —  nK{0) 

where  A  =  Aq  -f  Hi  and  n  =  no  +  t.  We  can  find  the  posterior  distribution 
for  9  by  normalizing  the  posterior  likelihood  so  it  integrates  to  1.  In  fact,  by 
the  definition  of  x,  the  normalization  factor  is  exp(-x(A,n)).  So,  the  posterior 
for  0  is  z/(0jA,n). 

Notice  that  the  posterior  distribution  of  9  depends  on  the  observed  samples 
zt  only  through  This  sum  is  called  a  sufficient  statistic  for  inference 

about  9,  since  once  we  know  it  we  need  no  other  information  about  the  ztB  to 
compute  the  posterior  distribution  for  9. 

Now  that  we  have  the  posterior  distribution  for  9,  we  can  compute  the  best 
prediction  w.  First  suppose  that  we  knew  9  exactly.  Then  the  expected  loss  on 
each  step  would  be 

Ee{l{w)  +  r{a)  —  w  ■  a) 

where  we  have  written  Ee  as  shorthand  for  £;(-|o  ~  m(o|^)).  Since  we  don’t 
know  9  exactly,  we  must  take  the  expectation  of  the  above  expression  under  our 
posterior  distribution  for  9.  That  yields  an  expected  loss  of 

l{w)  ExM{l*{a)))  -  w  ■  Ex,n{E0(a)) 

Since  I  is  convex,  we  can  find  the  w  which  minimizes  expected  loss  by  differen¬ 
tiating  and  setting  to  zero: 


0  €  81(11))  —  E\^n{Ee(a)) 

So,  we  can  pick  any  w  in  dl*(E\^n{Ee{a))). 

Technically,  we  need  to  worry  that  there  might  be  no  w  that  achieves  the 
minimum.  In  that  case  the  above  equation  would  have  no  solution.  But  our 
reasoning  below  will  provide  conditions  which  guarantee  that  the  expected  value 
is  always  in  intdom/*.  So,  under  those  conditions  a  solution  must  exist. 

Under  some  regularity  conditions  on  p,  we  can  compute  the  expected  value 
of  a.  First,  we  can  prove  by  differentiating  the  identity  /  /u(a\9)da  =  1  that 


Ee(a)  =  K'(9) 

(see  for  example  equation  2.2  (i)  of  [DY79]).  For  this  reason,  k'(9)  is  called 
the  expectation  parameter  of  the  distribution  fx.  Next,  by  applying  Theorem  2 
of  [Dy79],  we  find  that 

=  - 
n 
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Link  name 

Distribution 

Identity 

Normal 

Logistic 

Beta 

Binomial 

Inverse  logistic 

Binomial 

Beta 

Exponential 

Poisson 

Gamma 

Logarithmic 

Gamma 

Poisson 

Normalized  exponential 

Dirichlet 

Multinomial 

Figure  3,5:  Links  and  their  associated  distributions. 


So,  as  long  as  A/n  is  in  intdom/*,  there  will  be  a  legal  prediction  w.  But  A/n 
will  be  in  intdom/*  as  long  as  Xq/uo  is,  since  it’s  the  average  of  a  bunch  of 
quantities  in  doml*  at  least  one  of  which  is  in  intdom/*. 

But  now  we  have  arrived  back  at  our  original  algorithm:  to  find  the  pre¬ 
diction  just  average  together  and  then  apply  (/*)'.  Interestingly, 

this  conclusion  doesn’t  depend  on  which  exponential  family  we  choose  as  the 
distribution  for  a^.  Instead,  any  exponential  family  which  is  contained  in  dorri  I* 
results  in  the  same  optimal  prediction.  However,  if  we  choose  the  exponential 
family  so  that  k  =  then  we  can  interpret  w  as  the  inferred  value  of  the  natural 
parameter. 

The  mapping  {l*y  which  takes  us  from  the  observed  average  to  the  natural 
parameter  is  called  a  link  function,  just  as  it  was  for  generalized  gradient  descent. 
Figure  3.4  above  shows  some  useful  link  functions.  Figure  3.5  shows  which  link 
functions  correspond  to  which  exponential  families  if  we  choose  k  =  L 


3.10  Regression  problems 

A  common  type  of  prediction  problem  is  generalized  linear  regression  [MN83, 
LW92,  KW97],  which  includes  linear  regression,  logistic  regression,  other  gener¬ 
alized  linear  models,  perceptron  learning,  and  many  other  problems.  In  general¬ 
ized  linear  regression,  on  each  time  step  t  we  must  predict  a  vector  of  regression 
coefficients  Wf.  We  are  then  given  an  input  vector  xt,  from  which  we  form  a 
prediction  =  f(xt*Wt).  The  monotone  function  /  is  called  the  prediction  link 
function,  since  it  provides  a  link  between  the  coefficients  Wf  and  the  prediction 
yt.  Finally,  we  find  out  the  desired  output  yt  and  receive  a  loss  lt{w)  =l{yt^yt). 
Regression  problems  are  a  special  case  of  our  general  prediction  problem,  since 
they  differ  only  in  that  we  have  specified  a  particular  form  for  the  loss  function  k: 
for  example,  the  loss  functions  for  linear  regression  are  of  the  form  {yt—w*xt)^. 
We  should  not  confuse  the  prediction  link  function,  which  is  a  mapping  from 
E  to  E  that  connects  w-xt  with  the  prediction  y,  with  the  link  function  described 
earlier,  which  is  a  function  from  W  to  W  that  connects  the  natural  parameters 
with  the  expectation  parameters.  In  designing  an  algorithm,  we  can  choose  the 
two  kinds  of  link  functions  separately.  When  there  is  a  danger  of  confusion, 
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Name 

Link  Function  / 

Corresponding  F 

Step 

f  -1  p<0 

1  1  p>0 

bl 

€-insensitive 

[  -1  P<-e 

0  — e  <  p  <  e 

[  1  p>€ 

max(|p|  -  e,0) 

Huber 

-1  p<-€ 

p/e  -e<p<e 
[  1  p>e 

(  -p- e/2  p<  -e 

<  p^/2e  — e  <  p  <  c 

i  p-e/2  p>e 

Figure  3.6:  Some  examples  of  prediction  link  functions. 


we  will  call  the  latter  the  parameter  link  function.  All  of  the  one-dimensional 
parameter  link  functions  in  Figure  3.4  are  also  possible  choices  for  the  prediction 
link  function;  Figure  3.6  shows  some  additional  possible  choices. 

3.10.1  Matching  loss  functions 

In  order  to  apply  our  theory,  we  need  the  one-step  losses  lt{w)  to  be  convex. 
This  is  a  condition  on  the  relationship  between  the  prediction  link  function  / 
and  the  loss  function  It  turns  out  that,  given  a  monotone  link  function, 

we  can  always  define  a  matching  loss  function  so  that  lt{w)  is  convex.  If  /  is 
invertible,  we  follow  [AHW96]  and  define  its  matching  loss  function  to  be 

Ky^v)  =  DF*iy\y) 

where  F  is  any  convex  function  with  f  ==  F'. 

If  /  is  not  invertible  (that  is,  if  F  has  a  linear  segment,  so  that  F*  has 
a  corner)  then  the  above  definition  no  longer  works.  Intuitively,  the  problem 
is  that  our  predictions  get  “stuck”  as  they  cross  the  corner  in  F*:  there  is  a 
whole  range  of  p  with  the  same  / (p)  and  therefore  the  same  loss,  producing  an 
extraneous  flat  spot  in  If, 

We  can  fix  the  problem  by  allowing  kiw)  to  depend  on  pt  ==  xt  ■  w  di¬ 
rectly,  rather  than  just  on  f{pt).  More  specifically,  we  generalize  the  definition 
of  [AHW96]  and  set 


m(p, y)  =  Bp*  {y\p)  =  Bi? {p\y)  =  F{p)  ^F*{y)^y*p 

With  this  definition,  it  is  easy  to  see  that  m(p,  y)  is  convex  as  a  function  of  p, 
so  lt{w)  =  m{xt  •  w,yt)  is  convex  in  w.  Intuitively,  what  we  have  done  is  allow 
ourselves  to  specify  not  just  which  y  will  give  us  zero  loss,  but  also  what  the 
derivative  of  F*  is  at  that  point.  When  F*  is  smooth,  there  is  only  one  possible 
choice  of  derivative  for  each  prediction,  so  we  have  not  changed  anything;  but 
when  our  prediction  is  at  a  corner  of  F*  we  can  choose  from  a  range  of  possible 
derivatives. 
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We  will  use  the  derivative  of  the  loss  function  below.  It  turns  out  that  the 
prediction  error  is  a  derivative  of  m  with  respect  to  p: 

■  7(p)  -ye  dF(jp)  -  y  =  dpuiip,  y) 

So,  a  derivative  of  lt{w)  is  {f{xt  •  w)  —  yt)xt, 

3*10.2  Regret  bounds 

In  order  to  bound  the  regret  of  the  MAP  algorithm  for  regression  problems,  we 
need  to  find  a  gauge  g  so  that  We  have  already  done 

so  for  the  special  case  of  the  identity  link  with  squared  loss:  in  section  3.7.2, 
we  showed  that  the  allowable  choices  for  g  are  the  symmetric  gauges  such  that 
1  (Symmetric  gauges  are  also  called  seminorms.) 

The  situation  is  similar  for  general  link  functions  and  their  matching  loss 
functions.  In  this  case,  though,  we  must  make  one  additional  assumption:  we 
must  bound  how  quickly  the  prediction  yt  changes  when  we  change  the  raw 
prediction  pf. 

So,  we  will  assume  that  Bpiply)  >  ^{y  ^  f{p))^  foi'  some  A  >  0.  (This  is 
essentially  a  Lipschitz  condition  on  /.)  With  this  assumption,  we  can  write 

lt{w)  =  I^F{xfw\yt) 

But  we  saw  above  that  Vf.{w)  =  {f{xt  •  w)  —  yt)xt.  So,  if  p  is  a  symmetric  gauge 
such  that  Xg°{xt)  <  1,  then 

(yt  -  f{xt  ■  w)f  > 
hiw)  > 

But  now  we  have  proven 

Theorem  3.3  Let  F  be  a  closed  convex  function  with  Pf(p|p)  >  ^{y  —  f{p))^> 
Suppose  that  the  functions  ^  ure  of  the  form  lt{w)  =  3F{yt\w  •  xt)  for 

given  vectors  xt  and  scalars  yt*  Pick  a  prior  loss  Iq  and  functions  mi,  m2, . . 
and  suppose  that  /o,7ni,m2, . . .  satisfy  the  conditions  of  Lemma  3.1,  so  that  the 
MAP  algorithm  applied  to  these  loss  functions  always  produces  a  prediction  wt 
at  each  trial  Suppose  that  for  all  t,  dmt{wt)  is  nonempty,  mf{wt)  =  und 

<  hM  for  all  w.  Write  =  /q  +  symmetric  gauge  g 

be  so  that  Xg^{xt)  <  1  for  all  t.  Finally,  let  the  constants  1  >  >  772  >  •  •  •  >  0 

be  such  that 

m^Ltiv\a)  >  ^(giv-w))^ 

for  all  V  and  w  and  a  €  dLt{w).  Then 

E  ^  E  («|o) 
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for  all  u. 


Proof;  Apply  Theorem  3.2  to  the  functions  lo,li, . . .  with  at  —  0, 

using  the  gauge  g  and  the  learning  rates  rft-  □ 

While  the  size  of  the  input  vectors  xj  doesn’t  appear  explicitly  in  this  bound, 
it  affects  the  choice  of  g  and  therefore  the  allowed  values  of  rjt.  For  example, 
depending  on  the  size  of  the  input  vectors,  we  might  need  to  set  g{w)  to  either 
li^lli  or  10||ia||i.  At  the  cost  of  introducing  an  extra  parameter,  we  could  have 
written  the  theorem  to  allow  us  to  set  g{w)  -  ||«;||i  no  matter  the  scale  of  the 
input  vectors.  For  examples  of  the  application  of  this  theorem,  see  Section  3.11. 

3,10,3  Multidimensional  outputs 

So  far  in  our  regression  problems  we  have  assumed  that  the  target  yt  is  one¬ 
dimensional.  Our  proofs  work  equally  well,  though,  if  yt  is  selected  from  an 
arbitrary  vector  space  y.  In  that  case,  the  parameter  matrix  wt  will  be  a 
linear  mapping  that  takes  xt  to  pt  6  y.  The  prediction  link  function  /  will  be 
the  derivative  of  some  convex  function  F  on  y,  and  the  matching  loss  will  be 
®F(p«|2/t),  so  the  derivative  of  lt{w)  will  be  (/(tox*)  -  yt)xj. 

The  only  part  of  the  proof  that  requires  some  modification  is  the  definition 
of  the  gauge  g.  Since  is  a  matrix  and  xt  and  yt  are  vectors  of  possibly  different 
lengths,  we  need  different  gauges  to  measure  the  size  of  each  one.  (Previously 
we  had  used  g  for  w,  g°  for  xt,  and  |  •  |  for  yt.)  So,  we  will  assume  that  we 
have  symmetric  gauges  r  and  s  so  that  r°(xt)  <  1  and  Bjr  {p\y)  >  \s{y  -  f(p))^. 
Then  we  will  define  g  by  the  relation 

giw)  =  sup  s{wu) 

{u\r{u)<l} 

This  g  is  called  the  matrix  gauge  for  r  and  s.  Since  r  and  s  are  symmetric,  so 
is  g.  Also,  if  x  and  y  are  vectors  with  r°(x)  =  1  and  ${y)  =  1,  then  giyx"^)  =  1, 
since  s(yx’^u)  =  s{y)x'^u  <  s{y)r°{x)r{u),  with  equality  for  an  appropriately 
chosen  u. 

With  this  choice  of  g,  we  have 

^ail'tiw))^  =  ^giifiwxt) -yt)xj)^ 

=  ’>’°{xt)^^s{f{wxt) -yt)^ 

<  PF(wa;t|2/«) 

=  kiw) 

so  Theorem  3.3  applies  with  A  =  1.  (To  achieve  the  effect  of  varying  A  we  can 
simply  rescale  the  gauge  s.) 

For  example,  if  /  is  the  identity  prediction  link  (so  that  F{y)  =  \y'^y  and 
we  can  take  s  to  be  Euclidean  distance)  and  ||a^(||2  <  1  (so  that  we  can  take  r 
to  be  Euclidean  distance  also),  then  g{v))  will  be  the  matrix  2-norm  ||?n||2-  If 
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we  now  take  lo{w)  =  |  iw?-,  then  ^Diq{v\w)  >  lg{v  -  w)^,  so  we  can  take 

Vi  =  f 

Often  we  will  take  each  coordinate  of  /  to  be  one  of  the  one-dimensional  link 
functions  described  above.  This  kind  of  link  function  decomposes  the  multiple- 
output  prediction  problem  into  several  single-output  problems  which  share  a 
parameter  vector.  On  the  other  hand,  sometimes  we  may  know  about  depen¬ 
dencies  among  the  components  of  the  output  vector.  In  this  case  we  can  take 
advantage  of  our  knowledge  by  picking  a  prediction  link  function  that  encodes 
these  dependencies.  For  example,  if  we  have  reason  to  believe  that  the  output 
vector  has  covariance  matrix  E,  we  can  select  the  link  y  =  Up  with  its  matching 
loss  -  y'^p  + 

3.11  Linear  regression  algorithms 

In  this  section  we  will  analyze  several  gradient-descent-like  algorithms  for  lin¬ 
ear  regression:  standard  gradient  descent  and  two  exponentiated  gradient  algo¬ 
rithms  from  [KW97]  called  EG  and  EG"^.  These  algorithms  are  all  generalized 
linear  regression  algorithms,  and  therefore  MAP  algorithms. 

In  linear  regression  problems,  the  loss  on  trial  t  >  1  is  lt{w)  =  3^  (iy|0)  = 
This  is  the  loss  function  for  a  generalized  linear  regression  model 
using  the  identity  prediction  link  with  its  matching  loss  function,  the  squared 
error.  The  algorithms  differ  only  in  their  choice  of  prior  loss  /q* 

We  will  bound  the  regret  of  each  algorithm  by  applying  Theorem  3.3.  Be¬ 
cause  It  for  ^  >  1  always  has  the  form  given  above,  we  can  take  A  =  1  in 
Theorem  3.3;  so  the  main  part  of  the  analysis  of  each  algorithm  will  be  to  find 
appropriate  seminorms  g  and  g^  with  which  to  measure  the  parameter  vectors 
Wt  and  the  input  vectors  xt. 

First  consider  the  gradient  descent  algorithm  for  linear  regression,  defined 
by  the  update 

wt^i  rjiyt  -  Wt  •  xt)xt 

Gradient  descent  is  a  GGD  algorithm,  given  by  the  choice  lo{w)  =  ^||ty|l2  (or 
/o  =  ^||u;  — z/;i||2  if  we  want  a  starting  weight  vector  wi  ^0).  We  showed  above 
that  if  \\xt\\2  <  X  for  all  t  then  we  can  take  g'^ix)  =  ;^||a:||2  and  pt  =  X^p  in 
Theorem  3.3.  The  result  is  that 

-  1  -  X^T]  ^  2r](l  -  X^) 

Next  consider  the  exponentiated  gradient  algorithm.  EG  is  a  GGD  algorithm 
given  by  the  choice  lo{w)  =  so  its  update  is 


Vt+I,i  =  Wt,i  exp{p{y  -  Wt  *  Xt)xt,i) 


wt-{-i4  = 
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To  analyze  EG,  we  will  set  X  to  be  the  maximum  span  of  any  of  the  input 
vectors,  that  is,  ||a;t||sp  <  X  where 

Iklisp  =  maxarj  —  minxi 

i  i 

It  is  easy  to  check  that  ||  •  ||sp  is  a  seminorm.  We  can  bound  the  polar  of 
the  span  seminorm  by  splitting  its  argument  vector  into  components  parallel 
and  perpendicular  to  e  =  (1, 1, . . . ,  1)^.  We  have  ||e||sp  =  0,  so  ||e||°p  =  oo. 
On  the  other  hand,  if  x  has  no  component  along  e,  then  ||a:||sp  >  ||a;||oo)  so 
Iklisp  <  l|a;||i  <  lia;||2.  That  means  that,  for  any  v  and  w  and  a  €  dH{w), 


DHHa)>2(||u-u;|||p)2 

To  see  why,  remember  that  by  assumption  dH{w)  is  nonempty,  so  w  must  be  in 
the  unit  simplex.  So,  depending  on  whether  v  is  in  the  plane  containing  the  unit 
simplex,  either  v  —  w  has  a  nonzero  component  along  e,  in  which  case  HDjj  (u|a) 
is  infinite,  or  u  -  ru  is  perpendicular  to  e,  in  which  case  D/f  >  2||u  -  w|||.  In 
either  case  the  result  follows.  So,  we  can  take  g°(x)  =  ^||a;i|sp  and  r]t  =  \X‘^r) 
in  Theorem  3.3  and  conclude  that 


gW”')  s  rrps;  EW“)  + 


The  above  results  can  be  compared  to  to  Lemmas  5.2  (for  GD)  and  5.9 
(for  EG)  in  [KW97].  Unfortunately,  our  bounds  here  are  slightly  weaker  than 
the  ones  in  [KW97].  We  do  not  believe  that  this  is  due  to  a  weakness  in  our 
framework;  instead  we  believe  that  with  some  additional  work  our  theorems 
could  be  sharpened  so  that  they  are  a  strict  generalization  of  the  known  results 
for  linear  regression  with  GD  and  EG. 

After  deriving  the  results  mentioned  above,  the  authors  of  [KW97]  perform 
an  additional  step;  they  adjust  the  learning  rate  rj  so  that  the  two  terms  in  the 
regret  bound  have  comparable  coefiicients.  We  have  not  taken  this  step. 

Finally  consider  the  EG*  algorithm.  Just  as  in  [KW97],  we  could  prove 
bounds  on  EG*  by  reducing  it  to  EG.  Instead,  we  will  sketch  how  to  find  the 
prior  lo  that  yields  the  EG*  algorithm.  Finding  this  prior  is  important  both 
because  it  increases  our  understanding  of  EG*  and  because  it  is  a  good  first 
step  towards  a  direct  proof  of  the  regret  bound  for  EG*. 

The  EG*  algorithm  can  be  defined  by  its  parameter  link  function,  which  is 
(up  to  scaling)  given  by  the  mapping  w  =  f(x)  defined  as 

yj.  ^  exp(a;j)  -  exp(-a;i) 

Ei(exp(a;j)  +  exp(-a:j)) 

The  prior  loss  function  lo  for  EG*,  and  its  convex  dual  are  determined  up 
to  scaling  by  this  link  function.  We  can  find  lo{x)  by  integrating  f  along  any 
path  from  the  origin  to  x. 
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To  perform  the  integral,  we  will  choose  a  path  with  n  axis-parallel  segments: 
one  which  increases  the  first  coordinate  of  x  from  0  to  its  final  value  Xi,  then 
another  which  increases  the  second  coordinate  from  0  to  its  final  value  X2^  and 
so  forth.  The  integral  along  the  ith  segment  (which  varies  the  ith  coordinate  of 
ar)  is 

r*  exp(t)  -  exp{-t) 

Jo  exp{t)  +  exp(-i)  +  fcj 

where  the  constant 

i-l 

h  =  ^(exp(a:^)  +  exp{-Xj)) 
i=i 

is  determined  by  the  (constant)  values  of  the  other  n  —  1  coordinates  of  x  along 
the  ith.  segment.  The  result  of  this  integral  is 

-Xi  -  ln(2  -h  ki)  -h  ln(l  +  exp(2a:i)  +  ki  exp{xi)) 

Summing  this  expression  over  all  n  path  segments  gives 

n 

loix)  =  -  ln(2  +  ki)  +  ln(l  +  exp(2a;i)  +  ki exp(a:i))) 

i=l 

For  example,  if  n  =  2, 

l*[x)  =  -xi  -  X2  —  +  ln(l  +  exp(2x2)  +  (exp(a:i)  -h  exp(*-a:i))  exp(a:2)) 

—  ln(2  H-  exp(a:i)  -}-  exp(-xi))  H-  ln(l  -f  2exp(xi)  +  exp(2xi)) 

A  plot  of  this  function  is  in  the  left  panel  of  Figure  3.7;  it  looks  like  a  rounded- 
off  version  of  the  Loo  norm.  The  right  panel  of  Figure  3.7  shows  a  plot  of  the 
three-dimensional  version  of  Iq,  made  by  holding  one  argument  constant  at  7 
while  varying  the  other  two  in  [—10, 10].  In  other  words,  we  have  plotted  Iq  on 
a  two-dimensional  slice  of  .  This  plot  looks  like  a  rounded-off  version  of  the 
same  slice  of  the  Loo  norm  on  W .  The  characterization  of  Iq  as  a  rounded-off 
version  of  the  Loo  norm  makes  sense,  since  EG^  restricts  wt  to  have  bounded 
Li  norm  and  the  dual  of  S{x  |  ||a:||i  <  1)  is  the  Loo  norm. 

3.12  Discussion 

We  have  presented  a  unified  framework  for  deriving  worst-case  regret  bounds  for 
a  wide  class  of  learning  algorithms.  These  algorithms  include  weighted  majority; 
gradient  descent  and  generalizations  of  gradient  descent  such  as  exponentiated 
gradient;  linear  and  logistic  regression;  and  inference  of  the  natural  parameter  in 
an  exponential  family.  Because  we  have  wherever  possible  avoided  assumptions 
such  as  differentiability  of  the  loss  functions,  our  framework  also  includes  a  wide 
variety  of  new  algorithms  which  we  have  not  fully  explored. 

Our  unified  treatment  sheds  light  on  the  relationships  among  these  methods, 
and  it  provides  a  recipe  for  designing  and  studying  new  learning  algorithms.  For 
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Figure  3.7:  The  dual  of  the  potential  function  for  BG^, 


example,  we  showed  that  both  the  gradient  descent  and  exponentiated  gradi¬ 
ent  algorithms  for  linear  regression  are  MAP  algorithms.  By  casting  them  in 
this  common  framework,  we  revealed  that  the  only  difference  between  these  two 
algorithms  is  their  choice  of  prior  loss  function.  In  addition  to  allowing  a  com¬ 
mon  proof  of  the  regret  bounds  for  these  algorithms,  this  analysis  suggests  that 
we  can  design  new  linear  regression  algorithms  simply  by  picking  new  priors. 
These  priors  can  express  known  bounds  on  the  parameter  vector  (for  example, 
the  prior  kw^  -\-S{w\C)  yields  the  gradient  projection  algorithm  with  domain  C) 
or  preferences  for  particular  kinds  of  parameter  vectors  (for  example,  the  prior 
of  the  algorithm  prefers  vectors  with  low  Lx  norm). 

Our  results  also  suggest  new  applications  for  old  algorithms.  By  avoiding 
assumptions  such  as  independence  of  training  examples,  we  have  justified  the 
use  of  these  algorithms  in  situations  where  they  might  not  have  been  considered 
before. 
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Chapter  4 


CONVEX  ANALYSIS 
AND  MBPS 
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In  this  chapter  we  will  apply  the  ideas  of  convex  analysis  and  statistical  in¬ 
ference  to  the  problem  of  approximating  the  value  function  of  a  Markov  decision 
process.  In  Sections  4.1  through  4.4,  we  will  show  how  to  represent  an  MDP  as 
a  convex  program.  This  transformation  will  allow  us  to  apply  the  well-known 
theory  of  convex  programming  to  the  problem  of  finding  its  value  function.  In 
Section  4.5  we  will  show  how  to  represent  an  MDP  as  either  a  maximum  like¬ 
lihood  or  a  maximum  entropy  problem.  This  transformation  will  allow  us  to 
apply  the  well-known  theory  of  statistical  inference  to  the  problem  of  finding 
its  value  function.  In  Section  4.6,  we  will  describe  several  ways  we  have  tried  to 
introduce  approximations  of  the  value  function  into  these  two  representations 
of  an  MDP,  In  Section  4.7  we  will  describe  an  implementation  of  one  of  these 
algorithms.  Finally,  in  Section  4.8,  we  will  describe  some  experiments  we  have 
done  with  this  implementation.  While  the  experiments  show  that  this  particular 
algorithm  does  not  improve  on  the  best  existing  ones,  we  hope  that  the  ideas  of 
this  chapter  can  be  incorporated  into  other  algorithms. 


4.1  The  Bellman  equations 

We  saw  in  Chapter  1  that  the  value  function  for  an  MDP  is  the  unique  solution 
to  the  Bellman  equations 

v{x)  =  minE(c(a:,  a)  -f  7i;((J(a:,a))) 

(base  cases  such  as  v{(D)  =  0  may  be  necessary  if  7  =  1).  As  pointed  out  in  for 
example  [Ros83,  p40]  or  [BerTG,  p248],  we  can  rewrite  the  Bellman  equations  as 
a  linear  program  by  noticing: 

•  If  there’s  a  deterministic  action  a  that  takes  the  agent  from  state  x  to 
state  y  with  cost  c(x,  a),  then  v{x)  <  jv{y)  +  c{x,a). 

•  Similarly,  if  there’s  a  stochastic  action  a  that  takes  the  agent  from  state 

a:  to  a  probability  distribution  p  over  the  state  space,  then  v{x)  <  7P  *  v  + 
c{x, a).  The  notation  p  •  v  means  in  other  words,  p-v  is  the 

expectation  of  the  value  of  the  next  state. 

•  The  value  function  is  the  pointwise  largest  function  v  satisfying  these 
constraints  along  with  any  base  cases. 

The  resulting  linear  program  is 

maximize  s^v 

subject  to  Ev  +  c>0 

where  E  is  the  edge  adjacency  matrix  for  our  MDP  (defined  in  Chapter  1),  c 
is  the  cost  vector,  and  s  is  any  vector  with  all  components  positive.  For  this 
section  we  will  assume  that  s  has  all  components  equal  to  1;  in  Section  4.2.3  we 
will  attach  a  meaning  to  the  choice  of  objective  vector. 


73 


maximize  x  +  y  +  z-\-g  subject  to 


+1 

> 

0 

+z 

+2 

> 

0 

+9 

-1-1 

> 

0 

-z  +g 

+1 

> 

0 

-9 

> 

0 

(b) 


Figure  4.1:  How  to  turn  an  MDP  into  an  LP. 


Figure  4.1  shows  an  example  of  translating  a  simple  MDP  to  a  linear  pro¬ 
gram.  (To  avoid  clutter  we  have  adopted  the  shorthand  of  writing  just  x  instead 
of  v(x)  to  mean  the  value  of  state  x.)  This  MDP  happens  to  be  undiscounted  and 
deterministic,  but  the  translation  works  just  as  well  for  discounted  or  stochas¬ 
tic  MDPs.  There  is  one  constraint  in  the  program  (that  is,  one  row  of  E)  for 
each  edge  or  state-action  pair  in  the  MDP.  There  is  one  variable  in  the  program 
(that  is,  one  column  of  E)  for  each  state  in  the  MDP.  For  example,  the  row 
— a:  -b  2/  -(- 1  >  0  corresponds  to  the  edge  from  state  x  to  state  y  with  cost  1.  If 
there  were  a  unit-cost  action  that  moved  the  agent  from  state  x  to  state  y  with 
probability  .7,  and  from  state  x  to  state  z  with  probability  .3,  the  corresponding 
constraint  would  be  —a:  +  .7y  +  .3z  +  1  >  0. 

The  optimal  solution  to  this  MDP  is  (x,y,z,g)  =  (2, 1,1,0).  In  linear  pro¬ 
gramming  terminology,  the  elements  of  the  vector  Ev -h  c  =  (0, 1, 0, 0, 0)"^  are 
called  slacks;  in  dynamic  programming  terminology,  they  are  called  advantages 
or  Bellman  residuals.  In  either  case,  the  edges  in  the  optimal  policy  are  the 
ones  whose  slack  is  0.  That  means  that  an  optimal  policy  for  the  MDP  is  the 
same  as  an  optimal  basis  for  the  linear  program.  (This  is  a  consequence  of  the 
property  called  complementary  slackness.) 


4.2  The  dual  of  the  Bellman  equations 

4.2.1  Linear  programming  duality 

Every  linear  program  can  be  paired  with  another  linear  program  called  its  dual. 
The  original  (or  primal)  and  dual  programs  are  different  views  of  the  same 
problem:  the  optimal  values  of  their  objective  functions  are  the  same,  and 
knowing  a  solution  to  one  makes  it  much  easier  to  find  a  solution  to  the  other. 

We  can  derive  linear  programming  duality  by  appealing  to  duality  between 
convex  functions.  Consider  the  linear  program 

minimize  c^x  subject  to^-l-6  =  0,a;>0 
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We  can  eliminate  the  equality  constraints  by  adding  a  vector  y  of  Lagrange 
multipliers.  So,  solving  the  linear  program  is  equivalent  to  finding 

minmax([c^x  +  y^{Ax  H-  6)]  +  &{x\x  >  0))  (4.1) 

X  y 

The  notation  5{x\x  >  0)  is  defined  in  Chapter  3;  it  stands  for  the  function  which 
is  zero  if  x  >  0  and  oo  otherwise.  The  expression  in  square  brackets  is  called 
the  Lagrangian  of  the  linear  program.  If  the  program  has  a  finite  solution,  then 
we  may  interchange  the  order  of  minimization  and  maximization  to  get 

maxmin([c^x  H-  y'^{Ax  -h  &)]  +  5{x\x  >  0)) 
y  X 

=  —  min  maxf'-c^a:  —  y^ Ax  —  y^b  —  5{x\x  >  0)] 
y  X 

=  —  min  1^— 2/^6  +  max[— +  c)  —  S{x\x  >  0)]j 

=  -  min[-2/'^6  +  (5(a:|x  >  0))*(~-(^^2/ +  c)) 

=  maxly'^b- S{y\A^y c>0)] 
y 

In  other  words,  we  may  find  the  optimal  objective  value  for  our  original  linear 
program  by  solving  the  new  linear  program 

maximize  b^y  subject  to  A^y  +  c  >  0 

We  define  this  new  linear  program  to  be  the  dual  of  our  original  program.  If  we 
replace  A^y  +  c  >  0  by  A'^y  c  ^  z  >  0  and  then  apply  the  same  sequence 
of  transformations,  it  is  easy  to  verify  that  the  result  is  equivalent  to  the  primal 
program. 

4.2.2  LPs  and  convex  duality 

When  thinking  about  duality  between  linear  programs,  it  is  often  useful  to  re¬ 
member  the  specialization  of  the  theory  of  convex  duality  to  indicator  functions. 
As  defined  in  Section  3.2,  the  indicator  function  for  a  convex  set  C  is 

r/  f  0  x  E  C 

This  function  is  zero  inside  of  C  and  infinite  outside  of  (7,  so  if  we  want  to 
constrain  the  variable  x  in  a  minimization  problem  to  be  in  the  set  C  we  can 
add  S{x\C)  to  the  function  to  be  minimized. 

The  simplest  convex  sets  C  are  the  cones.  A  cone  is  the  set  of  all  positive 
linear  combinations  of  a  set  of  vectors  gi  called  its  generators.  If  we  write  G  for 
the  matrix  with  columns  then  C  =  {GA|A  >  0}.  Some  examples  of  cones  are 
the  origin  (generated  by  the  empty  set  of  generators),  any  linear  subspace,  and 
the  two  cones  shown  in  Figure  4.2.  If  the  set  of  generators  is  finite,  then  C  is  a 
closed  convex  get. 
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Figure  4.2:  Two  cones.  Heavy  lines  show  a  set  of  generators  for  one  of  the  cones. 

The  polar  of  a  cone  C,  written  (7°,  is  the  set  of  vectors  which  make  either  a 
right  or  an  obtuse  angle  with  every  vector  in  C.  That  is, 

=  {a:|(V^  ^C)  x^y  <Q} 

The  polar  is  always  a  closed  cone.  For  closed  cones,  the  operation  of  taking  the 
polar  is  its  own  inverse:  the  polar  of  the  polar  of  a  closed  cone  is  the  cone  itself. 
The  two  cones  in  Figure  4.2  are  polar  to  each  other.  As  the  figure  shows,  the 
extreme  vectors  of  a  cone  are  the  face  normals  of  its  polar.  Polarity  between 
cones  is  an  example  of  duality  between  convex  functions:  if  (7  is  a  cone,  then 
the  dual  of  the  indicator  function  5{x\C)  is  5(ar|(7°). 

We  can  represent  an  arbitrary  convex  set  in  as  the  intersection  of  a  cone 
in  with  a  fixed  plane.  For  example.  Figure  4.3  shows  the  representation 
of  a  triangle  in  as  the  intersection  of  a  cone  and  a  plane  in  .  Usually 
we  will  use  the  same  coordinate  system  for  as  we  did  for  ,  with  the 
addition  of  one  extra  coordinate  (call  it  t).  We  can  then  identify  with  the 
plane  t  =  1  in  E”+l ,  so  that  we  can  represent  the  convex  set  C  by  the  cone 
{{tx^t)\t  >  0,a?  E  (7}.  This  cone  is  called  the  hornogeneous  representation 
of  (7.  If  (7  is  an  affine  set,  then  we  can  use  either  the  regular  homogeneous 
representation  or  the  set  {{tx,t)\t  6  E,  a:  E  (7}  called  the  affine  homogenous 
representation  of  C. 

We  can  now  see  that  the  familiar  notion  of  geometric  duality  is  a  consequence 
of  polarity  between  convex  cones.  Two  affine  subspaces  C  and  D  are  defined  to 
be  geometrically  dual  if  x  -  y  =  1  for  all  a?  G  (7  and  y  E  while  two  arbitrary 
convex  sets  are  defined  to  be  geometrically  dual  if  x-y  <  Ifoi  x  e  C  mdy  ^  D, 
For  example,  the  line  a  •  x  =  1  and  the  point  a  are  dual  affine  subspaces  in  E^ , 
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Figure  4.3:  Homogeneous  representation  of  a  triangle. 


while  the  unit  cube  and  the  unit  octahedron  are  dual  convex  sets  in  .  If  C 
and  D  are  geometrically  dual  convex  sets,  then  the  homogeneous  representation 
of  C  is  the  polar  of  the  homogeneous  representation  of  Z),  reflected  along  the  t 
axis.  If  C  and  D  are  geometrically  dual  affine  sets,  then  the  affine  homogeneous 
representation  of  C  is  the  polar  of  the  affine  homogeneous  representation  of  D, 
reflected  along  the  t  axis. 

We  can  take  advantage  of  the  connection  between  convex  duality  and  cone 
polarity  to  analyze  how  operations  on  a  cone  change  its  polar.  For  example, 
intersection  of  two  cones  corresponds  to  addition  of  their  indicator  functions. 
The  dual  operation  for  addition  is  infimal  convolution,  defined  as 

(/  °  9){^)  =  inf{/(a)  +  5(6)|a  +  6  =  a;} 

If  /  and  g  are  indicator  functions  for  the  convex  cones  C  and  then  /(n)-hp(6) 
will  be  zero  if  a  G  C  and  b  ^  D,  and  infinite  otherwise.  So,  (/□p)(x)  will  be 
zero  iff  a:  G  C  D.  That  means  that  the  polar  oi  f\D^  (corresponding  to 
/*  4-  g*)  C  +  D  (corresponding  to  /  Dp);  in  other  words,  the  operations  of 
intersection  and  set  sum  are  polar  to  each  other. 

For  another  example,  if  C  and  D  are  the  homogeneous  representations  of 
convex  sets,  then  we  can  write  the  intersection  of  C®  with  the  plane  t  =  —  1 
as  the  union  of 


X{C^  n  (t  ^  -1))  +  (1  -  A)(D°  n  (t  ^  -1)) 

for  A  G  [0, 1].  In  other  words,  the  geometric  dual  of  the  intersection  of  two 
convex  sets  is  the  convex  hull  of  the  geometric  duals  of  the  sets. 
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The  connection  between  a  cone  and  its  polar  can  help  us  understand  the 
connection  between  a  linear  program  and  its  dual.  The  relationship  between  a 
linear  program  and  its  dual  is  clearest  for  the  degenerate  linear  program 

find  X  ^0  such  that  x  >0,  Ax  =  0  (4.2) 

This  problem,  called  the  homogeneous  linear  inequality  problem,  can  be  thought 
of  as  a  linear  program  whose  constant  vectors  are  both  zero.  It  is  equivalent  to 
asking  whether  there  is  an  x  5^  0  for  which  the  convex  function 

d(x|Q)  +  S(x\Ax  =  0)  (4.3) 

is  zero,  where  Q  is  the  nonnegative  orthant.  The  dual  problem  to  (4.2)  is 

find  X  ^  0,  j/  such  that  x  <  0,  x  =  A^y 

which  corresponds  to  the  question  of  whether  there  is  an  x  0  for  which  the 
convex  function 

(5(x|  -  Q)  +  6{x\x  =  A'^y)  (4.4) 

vanishes. 

The  expressions  (4.3)  and  (4.4)  are  almost,  but  not  quite,  convex  duals  of 
each  other.  The  dual  of  S{x\Q)  is  (5(x|  —  Q),  while  the  dual  of  5{x\Ax  =  0)  is 
S{x\x  =  A'^y).  But  the  dual  operation  to  addition  is  infimal  convolution,  so  the 
convex  dual  of  (4.3)  is 

(5(x|  —  Q)  P  (5(x|x  =  A’^y) 

which  is  the  indicator  function  of  the  set  —Q  +  {x|x  =  j4^y}.  In  other  words, 
there  are  four  different  convex  sets  associated  with  the  system  of  inequali¬ 
ties  (4.2):  the  intersection  of  the  positive  orthant  with  the  linear  constraint 
set,  the  sum  of  the  positive  orthant  and  the  constraint  set,  the  intersection  of 
the  dual  of  the  positive  orthant  with  the  dual  of  the  constraint  set,  and  the  sum 
of  the  dual  of  the  positive  orthant  and  the  dual  of  the  constraint  set.  Two  of 
these  four  sets  are  the  feasible  regions  for  (4.2)  and  its  dual,  while  the  other  two 
are  the  polar s  of  the  feasible  regions. 

For  general  linear  programs  the  situation  is  similar:  the  difference  is  that 
instead  of  the  indicator  <5 (x|  Ax  =  0)  we  have  the  function  6(x\Ax+b  =  0)  +  c^x, 
which  is  not  an  indicator  function.  (It  is  called  a  partial  affine  function,  since 
its  domain  is  an  affine  space  and  it  is  a  linear  function  on  its  domain.)  Still, 
we  can  construct  four  different  convex  functions  by  applying  either  addition  or 
infimal  convolution  to  either  the  indicators  of  Q  and  the  partial  affine  function 
or  their  duals.  Two  of  these  functions  represent  the  feasible  region  and  objective 
function  for  the  linear  program  and  its  dual. 

4.2.3  The  dual  Bellman  equations 

The  dual  of  the  Bellman  equation  linear  program  is 

minimize  c^f 
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minimize  fxy  +  2fxz  +  fyg  +  fzg  subject  to 


fxy 

fxz 

+1 

=  0 

fxy 

- 

~fy9 

+1 

=  0 

fxz 

~fzg 

+1 

=  0 

fyg  "b  fzg 

-fg  +1 

=  0 

fxy-i  fxzj 

fyg')  fzg^ 

fg>0 

Figure  4.4:  The  dual  of  the  Bellman  program. 


subject  to  f  +  5  =  0 

/>o 

This  linear  program  has  one  equality  constraint  (that  is,  one  row  of  E'^)  for 
each  state  of  our  MDP,  and  one  variable  (that  is,  one  column  of  E’^)  for  each 
edge  or  state-action  pair  of  our  MDP.  The  equality  constraint  for  state  y  is 


^^jPaxyfxa  "b  1  —  ^  ^  fya 
Xya  a 

We  can  interpret  fxa  as  the  expected  number  of  times  we  visit  the  edge  (x,  a)  if 
we  follow  one  trajectory  starting  from  each  state.  (If  there  is  a  discount  factor, 
then  fxa  is  the  expected  discounted  frequency.)  We  will  call  fxa  the  flow  along 
edge  (x,a).  Under  this  interpretation,  the  equality  constraint  for  state  y  tells 
us  that  we  must  enter  y  exactly  as  often  as  we  leave  it.  Since  the  objective 
function  f'^c  is  equal  to  the  expected  cost  of  visiting  the  edge  (a:,  a)  a  total  of 
fxa  times,  the  dual  Bellman  program  tells  us  to  minimize  the  total  expected 
cost  of  following  one  trajectory  starting  from  each  state. 

Clearly  it  is  not  necessary  to  start  exactly  once  in  each  state.  If  5  is  a  vector 
of  positive  starting  frequencies,  so  that  we  start  Sx  >  0  times  in  state  x,  then 
the  equality  constraint  for  state  y  becomes 


Y^Paxyfx 


+  Sjj 


x,a 


The  optimal  vector  of  flows  may  be  different  for  different  choices  of  5,  but  the 
linear  program  will  be  feasible  for  any  choice  of  s  >  0.  The  fact  that  any  positive 
vector  of  starting  frequencies  produces  a  feasible  dual  program  is  equivalent  to 
the  fact  that  any  positive  objective  vector  produces  a  bounded  primal  program. 

Figure  4.4  shows  the  dual  Bellman  equation  program  for  our  example  MDP 
from  Figure  4.1.  The  optimal  solution  to  this  program  is  {fxy^  fxz^  fyg^  fzg^  fg)  = 
(1, 0, 2, 1, 4).  Just  as  with  the  primal  program,  if  we  know  the  optimal  /  we  can 
find  the  best  edge  out  of  any  state;  any  edge  with  positive  flow  will  do. 
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4.3  Incremental  computation 


The  previous  sections  describe  how  to  convert  a  Markov  decision  process  to  a 
linear  program.  This  transformation  provides  a  simple  algorithm  for  finding  the 
value  function  of  a  known  MDP:  convert  it  to  a  linear  program,  then  solve  the 
linear  program  with,  say,  simplex  or  a  logarithmic-barrier  method.  For  some 
benchmarks  of  this  algorithm  versus  value  iteration,  see  [TZ93,  TZ95,  TZ97]. 

Often,  though,  we  don’t  know  the  entire  MDP  in  advance;  or,  even  if  we  do 
know  it,  it  is  so  large  that  we  can’t  afford  to  examine  every  state  even  once.  In 
either  of  these  cases,  we  need  an  incremental  version  of  the  above  algorithm. 
That  is,  we  need  to  be  able  to  convert  a  partly-known  MDP  into  a  linear  program 
in  such  a  way  that  when  we  solve  the  LP  we  end  up  with  something  close  to  the 
correct  value  function. 

Incremental  computation  often  goes  hand  in  hand  with  approximation:  if  our 
MDP  is  so  large  that  we  need  to  look  at  it  bit  by  bit,  then  we  will  often  also  need 
to  use  a  compact  representation  for  its  value  function.  For  this  section,  though, 
we  will  just  worry  about  incremental  computation,  and  leave  approximation  for 
Section  4.6.  In  other  words,  we  will  suppose  that  our  MDP  is  small  enough  that 
we  could  solve  it  exactly  if  we  knew  it,  but  that  we  are  finding  out  about  it  bit 
by  bit. 

There  axe  at  least  two  natural  orders  in  which  to  reveal  an  MDP  one  piece 
at  a  time:  edge  by  edge  or  state  by  state.  Since  every  edge  corresponds  to  a  row 
of  the  adjacency  matrix  E,  and  since  every  state  corresponds  to  a  column  of  E, 
these  two  orders  correspond  to  revealing  E  row  by  row  or  column  by  column. 

We  can  represent  either  of  these  two  orders,  and  many  more,  by  writing  Et, 
Ct,  and  St  for  our  best  approximations  to  E,  c,  and  s  at  time  t.  For  example, 
if  we  are  finding  out  about  our  MDP  edge  by  edge,  then  Et+i  -  Et  will  have 
nonzero  entries  in  exactly  one  row. 

With  this  notation,  it  is  natural  to  suppose  that  the  sequences  ft  and  vt 
defined  by  the  linear  programs 

minimize  cj ft  subject  to  EJ ft  +  st  =  0,ft>0 


maximize  sjvt  subject  to  EtVt  +  ct>0 

might  be  good  approximations  to  the  optimal  flows  and  values  /  and  u  respec¬ 
tively.  Unfortunately,  ft  and  Vt  do  not  necessarily  converge  to  /  and  v  even  if 
Et  E,  Ct  c,  and  St  s.  For  example,  a  small  change  in  Ct  can  cause  a 
discontinuous  jump  in  ft  if  it  causes  the  solution  of  the  flow  program  to  move 
from  one  corner  of  the  feasible  region  to  an  adjacent  one. 

'  There  are,  however,  some  convergence  results  that  do  hold  under  mild  con¬ 
ditions  if  Et  E,  Ct  c,  and  st  s  as  t  oo.  For  example,  if  the  primal 
and  dual  feasible  regions  are  bounded  and  the  primal  and  dual  optima  are  not 
degenerate,  then  /<  ->  /  and  vt-^v- 
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Figure  4.5:  A  Markov  decision  process  with  just  one  state, 

4.4  Soft  constraints 

Consider  the  MDP  shown  in  Figure  4.5.  It  has  just  one  state;  from  this  state 
the  agent  may  choose  any  of  k  actions,  with  costs  c(l) .  ..c{k),  each  of  which 
end  the  trajectory.  The  primal  and  dual  linear  programs  for  this  MDP  are 

maximize  v  subject  to  i;  <  c{l),v  <  c(2), c(k) 
minimize  c^f  subject  to  ==  1,  /  >  0 

i 

where  c  is  the  vector  with  elements  c(l) .  ,.c(k).  If  c(i)  is  the  smallest  element 
of  c,  then  the  solution  to  the  value  program  is  v  =  c(i),  while  the  solution  to 
the  flow  program  is  a  vector  /  with  a  1  in  the  ith  position  and  zeros  elsewhere. 

Let  Ci,C2, . . .  be  a  sequence  of  vectors  converging  to  c,  and  let  vt  and  ft  be 
the  solutions  to  the  linear  programs  that  result  from  replacing  c  with  Ct  in  the 
value  and  flow  programs  above.  Then  vt  will  converge  to  and  ft  will  converge 
to  /  as  long  as  there  is  a  unique  smallest  element  of  c. 

Unfortunately,  though,  vt  may  not  be  the  best  estimate  of  v  given  c*.  As 
pointed  out  in  [TS93],  if  the  elements  of  —  c  are  random  variables  with  zero 
mean,  then  vt  will  tend  to  underestimate  v.  The  reason  for  this  behavior  is  that 
the  errors  in  the  components  of  Ct  can  cause  the  smallest  element  of  Ct  to  have 
a  different  index  than  the  smallest  element  of  c.  The  underestimation  will  be 
most  pronounced  if  there  are  several  elements  of  c  that  have  almost  the  same 
value  as  the  smallest  element. 

We  can  at  least  partially  fix  this  problem  by  “softening”  the  inequality  con¬ 
straints  in  the  value  program,  so  that  vt  is  allowed  to  be  slightly  larger  than  the 
smallest  component  of  ct.  To  do  so,  we  will  pick  a  penalty  function  I  and  scaling 
factor  ^  >  0  and  replace  the  constraint  Vt  <  Ct^i  by  the  penalty  The 

idea  is  that  l{x)  should  be  small  for  negative  values  of  x  and  large  for  positive 
values  of  x^  so  that  there  is  a  penalty  for  making  Vf  too  much  larger  than  the 
smallest  component  of  Ct.  The  scaling  factor  lets  us  specify  how  much  uncer¬ 
tainty  there  is  in  the  components  of  the  smaller  is,  the  faster  the  penalty 
grows  as  Vt  increases. 
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More  precisely,  let  I  be  any  convex  function  with  l'{x)  0  as  a?  — oo  and 

V{x)  oo  as  X  — ^  00.  As  in  the  last  chapter,  V  stands  for  any  subgradient  of 
1.  If  domdl  is  not  all  of  M,  then  we  extend  V  to  E  by  taking  V{x)  =  -oo  for  x 
to  the  left  of  domdl  and  Z'(x)  =  -foo  for  x  to  the  right  of  domdl.  Given  such 
a  penalty  function  Z,  we  define  the  soft  value  program  with  parameter  ^  >  0  to 
be 


maximize  v  — 


i 


If  we  take  Z(x)  to  be  ^(x|x  <  0)  then  the  soft  value  program  is  identical  to  the 
value  program  for  any  //.  Usually,  though,  we  will  take  Z(x)  to  be  a  function 
that  approaches  its  limits  more  gradually,  say  Z(x)  =  e®.  In  this  case  the  value 
of  controls  how  hard  or  soft  the  constraints  are:  smaller  values  of  //  result  in 
harder  constraints.  In  fact,  under  mild  conditions  the  solution  to  the  soft  value 


program  will  approach  the  solution  to  the  original  value  program  as  ^  ^  0. 
The  dual  of  the  soft  value  program  is  the  soft  flow  program 


minimize  c'^/  +  M^Z*(/i)  subject  to  =  1 

i  i 


The  terms  l*{fi)  serve  as  barriers  to  push  the  elements  of  /  away  from  zero, 
so  the  constraint  /  >  0  is  no  longer  necessary.  (Because  of  this  fact,  fi  is 
usually  called  the  barrier  parameter.)  For  example,  if  Z(x)  =  5{x\x  <  0)  then 
Z  (x)  =  5{x\x  >  0);  or  if  Z(x)  =  e®  then  Z*(x)  =  xlnx  —  x.  More  generally,  since 
Z  (x)  — >■  0  as  X  — ^00,  Z*(x)  will  be  equal  to  oo  for  x  <  0,  and  since  V{x)  oo 
as  X  — >  00,  Z*(x)  will  be  finite  for  any  positive  x. 

The  barrier  terms  tend  to  push  the  components  of  /  away  from  zero,  while 
positive  components  of  c  tend  to  push  the  corresponding  components  of  /  to¬ 
wards  zero.  So,  the  largest  component  of  /  will  correspond  to  the  smallest 
component  of  c.  The  larger  fi  is,  the  closer  f  will  be  to  the  uniform  distribu¬ 
tion,  and  the  smaller  /i  is,  the  more  /  will  concentrate  its  weight  on  the  smallest 
components  of  c.  In  fact,  just  as  with  the  soft  value  program,  under  mild  con¬ 
ditions  the  solution  to  the  soft  flow  program  approaches  the  solution  to  the 
original  flow  program  as  0. 

Just  as  in  the  previous  section,  if  ci ,  C2, . . .  is  a  sequence  of  vectors  converging 
to  c,  then  we  can  define  an  incremental  algorithm  for  computing  /  or  by 
substituting  Ct  for  c  in  the  flow  or  value  programs.  (Because  we  have  assumed 
a  particular  form  for  the  edge  matrix  we  do  not  need  to  reveal  it  incrementally, 
and  because  there  is  only  one  state  we  do  not  need  to  reveal  the  vector  of 
starting  frequencies  incrementally.)  We  do  have  to  make  one  additional  choice, 
though:  we  must  choose  a  sequence  of  barrier  parameters  converging  to  //. 
(In  particular,  to  solve  the  original  linear  program  with  hard  constraints,  we 
should  have  fit  0.)  Then  we  can  write  the  incrernental  soft  flow  program  as 


minimize  cj  ft  +  ^  subject  to  ^  ft^i  =  1 

i  i 


where  ft^i  is  the  zth  component  of  ft.  The  incremental  soft  value  program  can 
be  written  similarly, 
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We  have  been  using  a  simple  linear  program  as  an  example,  but  we  can 
soften  the  constraints  of  an  arbitrary  linear  program  in  the  same  way.  To  the 
linear  program 

minimize  f  subject  to  f  +  5  =  0,  /  >  0  (4.5) 

corresponds  the  softened  program 

minimize  f  +  ^  subject  to  f  +  s  =  0  (4.6) 


with  its  dual 

•  ‘  T  I  f 

maximize  s  v  —  ji  y  I  — - j 

Linear  programs  are  invariant  to  scaling;  that  is,  for  any  A:  >  0  the  program 

minimize  k(^  f  subject  to  kE^  f  +  fcs  =  0,  /  >  0 

has  the  same  primal  and  dual  solutions  as  (4.5).  In  order  to  make  the  soft 
programs  invariant  to  scaling,  we  must  scale  ^  as  well;  the  program 

minimize  kc^f  +  subject  to  kE^  f  ks  =  0 

i 

has  the  same  primal  and  dual  solutions  as  (4.6). 


4.5  A  statistical  interpretation 

While  there  are  many  possible  choices  for  the  penalty  function  I  in  the  soft 
value  and  flow  programs,  picking  l{x)  =  e®  for  the  MDP  of  Figure  4.5  results 
in  a  familiar  algorithm.  Since  l*{x)  =  xlnx  —  x,  the  soft  flow  program  can  be 
written 

minimize  c^/  + 

i  \  i 

This  minimization  problem  is  almost  the  same  as  the  one  that  yields  the  WM 
algorithm  from  the  previous  chapter.  To  complete  the  analogy,  write  xt  for  the 
vector  of  expert  losses  on  trial  t  and  let  ct  =  j  Si=i  Then  if  we  set 

fit  =  f  j  we  have 

ft  =  argmm(c?’/ '+  titH{f))  =  argmm(fr(/)  +  r]Xt) 

SO  the  incremental  flow  programs  produce  the  same  series  of  predictions  as  the 
WM  algorithm. 
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These  predictions  also  have  a  simple  Bayesian  interpretation.  Let  us  suppose 
that  the  experts  are  predicting  a  sequence  of  binary  random  variables  yt^  Sup¬ 
pose  also  that  there  is  a  single  best  expert,  so  that  the  true  outcome  is  always 
equal  to  the  best  expert’s  prediction  plus  some  random  noise.  Our  task  is  then 
to  distinguish  between  the  n  statistical  models  “expert  i  is  best.”  Write  pi^i  for 
the  prior  probability  that  expert  i  is  best,  and  for  the  posterior  after  seeing 
the  first  t  —  \  examples.  Write  p{y\y)  for  the  probability  of  outcome  y  given  that 
the  best  expert  predicts  y.  Then  we  can  compute  the  posterior  probabilities  of 
our  models  with  Bayes’  rule: 


o:  pt^ip{yt\yt^i) 

So,  if  we  initialize  =  —  lnpo,i  2ind  update 

Xt-^x^i  =  Xi^i  -J-  Xt^i 

where  Xt^i  =  —\np(yi\yi  i^^  then  the  posterior  probabilities  at  each  time  step 
are  just  pt^  =  which  are  the  same  as  the  predictions 

of  the  WM  algorithm  with  learning  rate  7/  =  1. 

More  generally,  we  can  interpret  the  soft  value  and  flow  programs  for  arbi¬ 
trary  Markov  decision  processes  (or  in  fact  any  primal  and  dual  pair  of  softened 
linear  programs)  as  statistical  estimation  problems.  The  remainder  of  this  sec¬ 
tion  explores  this  connection  in  more  detail. 

4*5.1  Maximum  Likelihood  in  Exponential  Families 

One  of  the  simplest  statistical  inference  methods  is  maximum  likelihood:  given  a 
family  of  probability  distributions,  pick  the  one  which  maximizes  the  probability 
of  an  observed  sample.  More  formally,  suppose  we  have  a  set  X  of  possible 
outcomes.  (We  will  assume  X  is  finite,  but  much  of  the  following  carries  over 
to  infinite  sets  of  outcomes.)  Write  fx  foi*  the  normalized  frequency  of  outcome 
X  e  X  m  the  observed  sample.  Suppose  that  our  family  of  distributions  is 
indexed  by  a  parameter  vector  6,  and  write  f^{e)  for  the  predicted  probability 
of  outcome  x  given  9,  Then  the  maximum  likelihood  problem  is  to  find 

arg max  ^  /*  In  f^{e)  (4.7) 

x&X 

that  is,  to  find  the  6  which  maximizes  the  log-likelihood  of  the  observed  sample. 

Often  the  distributions  h{9)  will  form  an  exponential  family,  that  is,  a  set 
of  distributions  for  which  we  can  write 

fx{0)  =  exp(tT(9  +  h^+  g{e))  (4,8) 

Many  well-known  sets  of  distributions  are  exponential  families,  for  example  the 
normal,  gamma,  exponential,  chi-squared,  Dirichlet,  multinomial,  and  Poisson 
families, 
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In  Equation  4.8  the  vectors  tx  and  scalars  hx^  one  for  each  possible  outcome 
X  G  A',  together  define  the  exponential  family.  The  function  g{0)  is  called  the 
cumulant  generating  function,  and  it  is  determined  by  the  requirement 

m  5;  Me)  =  1  (4.9) 

We  can  interpret  each  component  of  tx  as  a  relevant  feature  or  statistic  about 
outcome  x.  For  example,  if  A'  is  a  set  of  real  numbers,  we  can  associate  the 
features  x  and  with  outcome  x.  Then  we  can  set  hx  =  0,  and  the  result  will 
be  a  family  of  discrete  normal  distributions.  The  constants  hx  allow  us  to  define 
subfamilies:  for  example,  we  can  define  a  family  with  fixed  variance  by  setting 
hx  to  a  multiple  of  x^  and  using  just  the  single  feature  x. 

One  reason  exponential  families  are  important  is  that  their  maximum  like¬ 
lihood  problems  can  be  written  as  convex  programs.  By  substituting  (4.8)  into 
(4.7)  and  using  (4.9)  to  constrain  g  to  be  equal  to  g{0),  we  can  see  that  the 
maximum  likelihood  problem  for  an  exponential  family  is 

argmax  ^  fx{tj0  -f*  g)  subject  to  ^  exp{tj6  +  +  p)  =  1 

xe^  xex 

The  above  is  not  a  convex  program,  since  the  equality  constraint  does  not  in 
general  define  a  convex  set.  However,  it  is  equivalent  to  the  convex  program 

argmax  ^  fxit^e  +  9)-'^  exp(ij0  +  hx+g) 

xex 

To  see  why,  we  can  explicitly  perform  the  maximization  with  respect  to  g  by 
differentiating  and  setting  to  0: 

0  =  H  “  XI  +  hx  +  g) 

xex  xex 

1  =  ^  exp(tj0  +  hx+g) 

x^X 

This  expression  is  exactly  the  equality  constraint  from  the  maximum  likelihood 
problem,  and  substituting  it  back  into  the  maximization  gives  us  the  correct 
objective  function. 

4.5.2  Maximum  Entropy  and  Duality 

We  can  gain  some  insight  into  the  maximum  likelihood  problem  for  exponential 
families  by  noticing  that  it  is  the  convex  dual  to  another  problem,  called  linearly 
constrained  maximum  entropy.  As  mentioned  earlier,  the  maximum  likelihood 
problem  for  exponential  families  is  to  find 

argmax  ^  fxit^O  +  p)  -  X  +  hx  +  g) 

xex  xex 
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We  can  write  this  problem  more  compactly  by  making  two  slight  modifications. 
First,  if  we  redefine  by  adding  an  extra  component  at  the  end  which  is 
always  1,  then  we  can  represent  g  as  the  last  component  of  6  instead  of  writing 
it  separately.  Second,  if  we  define  a  matrix  T  whose  rows  are  the  feature  vectors 
tx,  we  can  write  the  problem  in  matrix  notation.  With  these  two  modifications 
the  problem  becomes 

arg  max  pTO  -  exp(T0  4-  h) 

where  for  any  vector  x  the  notation  exp(x)  means  the  vector  whose  components 
are  exp(a;i)  and  means 

The  constrained  maximum  entropy  problem  makes  no  reference  to  6  or  the 
exponential  family.  Instead  it  is  defined  over  all  probability  distributions  which 
agree  with  /  on  the  expected  value  of  each  feature.  Subject  to  these  linear 
constraints,  we  wish  to  find  the  distribution  which  maximizes  entropy  with 
respect  to  some  known  distribution  q.  In  other  words,  we  want 

arg  mm  X!  subject  to  T’^/  =  T^/  (4. 10) 

Since  /  is  normalized,  and  since  we  added  an  extra  column  of  Is  to  T,  one  of 
the  equality  constraints  in  (4.10)  forces  /  to  be  a  probability  distribution. 

To  convert  maximum  entropy  into  maximum  likelihood,  we  need  to  use  a 
vector  of  Lagrange  multipliers  (call  it  A)  to  eliminate  the  equality  constraints: 

arg  mm  A  In  A  -  h  In  Qx  +  (T^/  -  T^/) 

Then  we  can  dualize  by  interchanging  the  order  of  minimization  and  maximiza¬ 
tion  and  performing  the  minimization  explicitly.  Since  the  minimum  must  occur 
at  an  interior  point  of  the  region  /  >  0,  we  can  find  it  by  setting  derivatives  to 
zero: 

UeA*  x£X 

=  1  +  In  -  In  g,;  - 

fx  =  exp(A'^ia.  +  Ingj,  -  1) 

Substituting  this  value  of  /  back  into  the  optimization  problem  and  cancelling 
terms  gives  (note  that  we  have  performed  the  substitution  in  two  stages  to  make 
the  cancellations  clearer); 

argmax  [/xCA^t^  +  Ing*  -  1)  -  /*  Ing^  +  -  fx)] 

xex 

=  argm^  [-/x  +  A'^tx/4 

=  arg  max  -  ^  exp(rA  +  In  g  -  1)  +  pTX 

.  x^X 
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Finally,  putting  hx  =  l^Qx  ^  ^  =  A  completes  the  transformation  from 

maximum  entropy  to  maximum  likelihood. 


4.5.3  Relationship  to  linear  programming  and  MDPs 

We  can  interpret  the  constrained  maximum  entropy  problem  (4.10)  as  a  linear 
program  plus  an  entropy  barrier  term.  In  fact,  the  only  differences  between 
Equation  4.6  with  r{x)  —  xhxx  and  Equation  4.10  are  that  the  barrier  term  in 
Equation  4.10  has  a  fixed  weight,  the  matrix  T  in  Equation  4.10  is  required  to 
have  a  column  of  all  ones,  and  the  vector  /  is  required  to  sum  to  1.  The  first 
difference  is  not  a  loss  of  generality,  since  we  can  always  rescale  any  soft  linear 
program  so  that  the  barrier  term  has  weight  1. 

The  required  column  of  ones  in  T  and  the  normalization  of  /  are  a  loss 
of  generality  compared  to  Equation  4.6,  but  we  can  remove  these  restrictions 
from  Equation  4.10  without  damaging  its  statistical  interpretation.  Allowing 
an  arbitrary  /  >  0  just  means  that  /  no  longer  has  to  sum  to  1;  we  can 
interpret  such  an  /  as  encoding  both  a  probability  distribution  and  a  sample 
size.  The  constant  column  in  T  serves  to  make  the  sample  size  of  /  match 
the  observed  sample  size  from  /,  just  as  any  other  column  of  T  serves  to  make 
some  other  feature  of  /  match  its  observed  value  from  /.  So,  a  matrix  T 
without  a  constant  column  corresponds  to  a  statistical  estimation  problem  in 
which  we  have  not  observed  the  sample  size.  While  such  statistical  estimation 
problems  are  unusual,  they  do  exist.  In  fact,  Markov  decision  processes  are 
a  good  example:  in  an  MDP  we  observe  how  often  trajectories  start  at  each 
state,  but  we  do  not  observe  how  often  we  should  visit  each  transition,  since  the 
latter  depends  on  which  policy  we  follow.  So,  the  sample  size  (that  is,  the  total 
number  of  transitions  we  visit  while  following  an  optimal  policy)  is  just  another 
parameter  that  we  can  estimate  from  the  observed  data.  Trying  to  constrain 
the  sample  size  of  /  to  match  that  of  /  would  be  a  mistake:  for  example,  in  a 
shortest-paths  MDP  this  constraint  would  prevent  us  from  considering  exactly 
the  policies  that  we  want  to  consider,  the  ones  that  visit  fewer  states  than  our 
sample  trajectories  do. 


4.6  Introducing  approximation 

Section  4.4  discussed  how  we  can  soften  the  constraints  in  the  linear  program 
representation  of  a  Markov  decision  process.  This  softening  combats  the  sys¬ 
tematic  errors  introduced  by  random  fluctuations  in  our  estimates  of  the  coeffl- 
cients.  The  amount  of  softness  is  controlled  by  the  barrier  parameter  /i.  As  we 
get  better  estimates  of  the  coefficients,  our  goal  is  to  reduce  fi  to  zero. 

In  this  section  we  will  discuss  how  to  introduce  an  approximate  representa¬ 
tion  of  the  value  function  into  the  linear  program  for  a  Markov  decision  process. 
These  two  modifications,  softening  and  approximation,  are  complementary:  ap¬ 
proximation  introduces  errors  into  the  coefficients,  and  we  can  minimize  the 
effects  of  these  errors  by  a  process  related  to  softening, 
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maximize  v{4) 

subject  to 

v(0) 

< 

0 

i;(l)  - 1;(0) 

< 

1 

v{2)  -  u(l) 

< 

2 

v{3)  -  v(2) 

< 

-1 

u(4)  —  v{3) 

< 

2 

Figure  4.6:  A  linear  program  with  its  true  solution  and  two  approximate  solu- 
tions. 


4.6.1  A  first  try 

Suppose  that  we  have  decided  on  a  particular  approximate  representation  for 
our  value  function,  say  v  =  Aw,  Here  ?i;  is  a  vector  of  adjustable  parameters  and 
^  is  a  matrix  whose  columns  are  a  set  of  basis  vectors  for  representing  v.  The 
matrix  A  will  have  one  row  for  each  state  in  our  MDP  and  one  column  for  each 
basis  vector.  This  notation  encompasses  any  representation  for  v  that  is  linear 
in  its  parameters,  including  linear  or  polynomial  regression,  splines,  wavelets, 
CMACs,  and  many  others. 

The  simplest  way  to  introduce  this  approximate  representation  into  our  lin¬ 
ear  program  is  just  to  substitute  Aw  for  v  everywhere.  Doing  so  yields  the 
following  modification  of  the  value  program 

maximize  Aw  subject  to  EAw  H-  c  >  0  (4.11) 

with  the  dual 


minimize  f  subject  to  f  -1-  s)  =  0,  /  >  0 

The  solution  to  Equation  4.11  can  be  a  good  approximation  to  the  true  value 
function  u,  particularly  if  the  span  of  our  basis  function  matrix  A  contains  a  low- 
error  approximation  to  v.  For  examples  of  some  MDPs  for  which  this  approach 
works  well,  see  [TZ93,  TZ95,  TZ97]. 

On  the  other  hand,  if  the  best  representations  in  the  span  of  A  have  moderate 
error,  then  the  quality  of  the  solution  we  find  with  Equation  4.11  can  degrade 
rapidly.  For  example.  Figure  4.6  shows  the  linear  program  corresponding  to  a 
simple  MDP,  along  with  two  approximate  solutions.  The  true  solution  is  shown 
as  a  solid  line.  If  we  substitute  in  the  representation  v{x)  =  wix  A  wq  we 
might  hope  to  get  the  approximate  solution  shown  in  long  dashes.  But  instead, 
Equation  4.11  yields  the  solution  shown  in  short  dashes.  The  reason  is  that 
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the  inequality  i;(3)  —  ^(2)  <  —1  constrains  the  slope  wi  to  be  no  greater  than 
—1.  The  solution  in  long  dashes  violates  this  constraint  (its  Bellman  residual 
along  this  edge  is  negative)  and  so  is  not  feasible.  More  generally,  if  our  basis 
matrix  A  has  k  columns,  the  solution  to  Equation  4.11  will  satisfy  the  k  most 
restrictive  constraints  exactly  and  leave  the  others  slack.  If  our  approximate 
representation  for  v  is  inflexible  enough,  it  is  even  possible  that  Equation  4.11 
will  have  no  solutions. 

To  see  what  the  problem  is  with  Equation  4.11,  we  can  turn  to  the  interpre¬ 
tation  of  a  linear  program  as  a  game.  As  Equation  4.1  shows,  a  linear  program 
is  a  minimax  problem  for  a  bilinear  form  called  the  Lagrangian.  If  the  linear 
program  is 

minimize  c^/  subject  to  f  +  s  =  0,  /  >  0 

then  the  minimizing  player  must  choose  a  vector  s  and  a  nonnegative  vector  /, 
while  the  maximizing  player  simultaneously  chooses  a  vector  u;  then  the  payoff 
to  the  maximizing  player  is  the  value  of  the  Lagrangian 

If  we  now  substitute  the  approximate  representation  v  =  Aw  into  this  game,  we 
have  restricted  the  actions  of  the  maximizing  player  while  leaving  the  minimizing 
player  untouched.  In  doing  so  we  have  given  the  minimizing  player  an  advantage, 

4.6,2  Approximating  flows  as  well  as  values 

To  restore  balance  to  the  game,  we  must  somehow  restrict  the  minimizing  player. 
We  will  do  so  by  adding  a  penalty  term  /(/)  to  the  Lagrangian.  The  resulting 
penalized  Lagrangian  is 

ip(/,  +  s)  +  l{f) 

Just  as  before,  the  minimizing  player  wants  to  choose  /  >  0  to  make  Lp{f,v) 
as  small  as  possible,  while  the  maximizing  player  simultaneously  chooses  v  to 
make  Lp{f,v)  as  large  as  possible.  There  are  many  different  possible  penalty 
terms,  each  leading  to  a  different  algorithm.  Depending  on  how  we  choose  the 
penalty,  the  resulting  game  may  favor  the  minimizing  player,  the  maximizing 
player,  or  neither.  We  have  already  seen  one  example  of  a  possible  penalty,  the 
barrier  term  in  the  soft  flow  program.  A  disadvantage  of  using  the  barrier  term 
as  our  only  penalty  is  that  it  is  not  clear  bow  to  choose  the  barrier  parameter 
/i  to  exactly  cancel  the  advantage  we  have  given  to  the  minimizing  player. 

For  the  remainder  of  this  chapter  we  will  examine  a  different  kind  of  penalty 
term:  we  will  restrict  the  minimizing  player’s  choice  of  /  to  lie  in  a  linear 
subspace.  If  the  subspace  is  given  as  the  span  of  the  columns  of  the  matrix 
B,  then  restricting  /  to  lie  in  spanB  is  equivalent  to  using  the  penalty  term 
5(/|spanJB). 

The  advantage  of  this  type  of  penalty  term  is  that  there  is  a  sirnple  way 
to  maintain  balance  between  the  two  players.  If  our  MDP  has  n  states  and  m 
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edges,  and  if  the  matrix  A  we  are  using  to  approximate  the  value  function  has 
rank  A:,  then  we  can  choose  B  to  have  rank  m  —  n  +  fc.  That  way  we  have  taken 
n  -  fc  degrees  of  freedom  away  from  each  player.  Different  choices  for  B  will 
result  in  different  algorithms. 

Choosing  the  penalty  S{f\  span  5)  to  compensate  for  the  approximate  rep¬ 
resentation  V  =  Aw  results  in  the  problem 

mm  max  {c^f  +  {Aw)'^{E'^f  +  s)  +  (5(/|  spanB)) 

which  we  can  also  express  as  the  linear  program 

minimize  c^f  subject  to  A^{E^f  +  s)  =  0,f  >0,f  =  Bg  (4.12) 
We  will  discuss  algorithms  for  solving  such  a  linear  program  in  Section  4.7. 


4.6.3  An  analogy 

It  is  instructive  to  consider  an  analogy  to  the  problem  of  solving  an  overdeter¬ 
mined  system  of  linear  equations.  Suppose  we  have  an  n  x  n  square  matrix  M 
and  an  n- vector  b  and  we  want  to  find  an  x  so  that  Mx  =  b.  Suppose  also  that 
M  is  so  large  that  we  need  to  use  the  approximate  representation  x  =  Ay,  where 
^  is  an  n  X  A:  matrix  of  basis  vectors.  The  system  MAy  ==  6  is  overdetermined, 
and  so  in  general  will  have  no  solutions. 

To  find  a  reasonable  value  for  y,  we  can  write  the  system  of  equations  as  a 
minimax  problem: 

maxminp^(Mx  —  6) 

X  p 

Since  we  have  restricted  the  actions  of  the  maximizing  player  by  requiring  x  = 
Ay,  we  need  to  define  a  penalty  function  l{p)  that  restricts  the  actions  of  the 
minimizing  player.  One  common  choice  for  l{p)  is  the  squared  Euclidean  length 
of  p.  This  choice  of  penalty  results  in  the  algorithm  called  least  squares  or  linear 
regression:  since  |||  •  |||  is  a  self-dual  function, 

mm  (p^iMAy  -b)  +  ^\\p\\i^  =  -^\\b-MAy\\l 

Another  choice  of  penalty  is  the  indicator  function  6(p\  spanB)  for  some  n  x  k 
matrix  B.  A  little  algebra  shows  that  the  solution  to  the  resulting  minimax 
problem  satisfies  the  system  of  equations 

B^MAy  =  B^b  (4.13) 

Equation  4.13  shows  why  the  appropriate  dimensions  for  B  are  n  x  k:  if  we 
don’t  take  the  same  number  of  degrees  of  freedom  away  from  the  minimizing 
and  maximizing  players,  Equation  4.13  will  be  either  over-  or  underdetermined. 

If  we  choose  B  —  MA^  then  the  equations  in  (4.13)  are  called  the  normal 
equations.  The  solution  to  the  normal  equations  is  the  same  as  the  solution  to 
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the  least  squares  problem.  The  fact  that  we  can  represent  linear  regression  in 
these  two  different  ways  is  a  consequence  of  the  fact  that  the  derivative  of  |||p||2 
is  the  identity  function;  this  connection  is  similar  to  the  idea  of  a  link  function 
described  in  Sections  3.6  and  3.10. 

Other  possible  choices  for  B  include  setting  B  =  DA  for  some  diagonal 
matrix  of  nonnegative  weights  jD,  and  setting  each  column  of  to  be  a  different 
one  of  the  n  unit  vectors  in  W .  The  choice  B  =  DA  is  not  used  very  often, 
since  it  is  not  usually  any  easier  to  implement  than  linear  regression.  Setting 
^  to  a  collection  of  unit  vectors  is  the  same  as  picking  k  of  the  n  equations  in 
MAy  =  bto  solve  and  throwing  the  others  away.  This  algorithm  is  useful  since 
it  requires  much  less  computation  than  linear  regression,  although  the  quality 
of  the  resulting  solution  may  not  be  as  good. 

When  our  Markov  decision  process  is  a  Markov  process,  the  linear  program 
for  finding  the  value  function  reduces  to  a  set  of  linear  equations.  So,  we  can 
use  any  of  the  above  approximate  linear  equation  solving  algorithms  to  find  an 
approximation  to  the  value  function  of  a  Markov  process.  Chapter  5,  including 
Sections  5.3.4,  5.3.5,  and  5.4.1,  contains  a  more  detailed  comparison  of  these 
algorithms, 

4.6.4  Open  problems 

The  choice  of  S{f\  span  5)  as  a  penalty  term  is  not  perfect.  Its  largest  problem 
is  that  the  linear  program  (4.12)  does  not  necessarily  have  a  solution:  it  is 
possible  that  restricting  /  to  be  in  the  span  of  B  makes  (4.12)  infeasible,  and  it 
is  possible  that  restricting  v  to  be  in  the  span  of  A  makes  (4.12)  unbounded. 

If  we  know  a  vector  /o  >  0  which  is  either  feasible  or  approximately  feasible, 
there  is  a  simple  trick  to  make  sure  that  Equation  4.12  has  a  solution.  If  /o 
is  exactly  feasible  we  can  replace  whatever  B  we  were  going  to  use  by  DB, 
where  D  is  the  diagonal  matrix  with  entries  /q.  Then,  as  long  as  the  original  B 
could  represent  the  vector  of  all  ones,  DB  can  represent  /q.  If  we  now  ensure 
that  (4.12)  is  bounded,  for  example  by  requiring  that  the  cost  vector  is  positive 
(c  >  0),  then  there  will  be  a  finite  optimal  solution.  If,  on  the  other  hand,  /o  is 
only  approximately  feasible,  we  can  replace  the  starting  frequencies  s  by  —E^fo. 
If  /o  were  exactly  feasible  then  this  replacement  would  not  change  the  starting 
frequencies,  since  feasibility  implies  —E^fo  =  s.  Since  /o  is  not  feasible,  the 
replacement  will  change  $  so  that  /o  is  feasible  in  the  modified  program.  Then 
we  can  set  D  to  the  diagonal  matrix  with  entries  /o  and  proceed  as  before. 

Even  if  we  do  ensure  feasibility  this  way,  though,  there  is  no  guarantee  that 
any  vector  other  than  /o  is  feasible.  In  other  words,  it  may  not  be  possible  to 
evaluate  any  policy  other  than  the  one  which  generated  our  training  data. 

Another  difficulty  is  that,  while  the  most  pleasing  approximations  to  the 
value  function  have  approximately  equal  total  Bellman  error  in  the  positive 
and  negative  directions,  the  performance  of  the  greedy  policy  is  affected  in  an 
inherently  asymmetric  way  by  Bellman  errors  of  opposite  sign.  Positive  residuals 
correspond  to  states  whose  estimated  cost  is  too  low,  and  such  states  tend  to 
attract  flow,  while  negative  residuals  correspond  to  states  whose  estimated  cost 
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is  too  high,  and  such  states  tend  to  repel  flow.  So,  in  the  worst  case,  a  single 
large  positive  error  could  cause  the  greedy  policy  to  spend  all  of  its  time  in  one 
state,  while  a  single  large  negative  error  can  only  cause  the  greedy  policy  to 
avoid  one  state  (plus  any  states  which  are  only  reachable  through  that  state). 

Besides  the  restricting  the  minimizing  player  to  a  linear  subspace,  there  are 
many  other  ways  to  choose  a  penalty  function.  For  example,  we  could  restrict 
the  minimizing  player  to  a  convex  set  such  as  a  cube  or  a  simplex  instead  of  to  a 
subspace.  Or,  we  could  remove  some  restrictions  on  the  minimizing  player  while 
adding  others;  for  example,  while  we  have  restricted  the  minimizing  player  to 
the  intersection  of  the  positive  orthant  with  the  span  of  B,  we  could  equally 
well  have  restricted  to  the  projection  of  the  positive  orthant  onto  the  span  of 
B.  Finally,  at  the  cost  of  giving  up  convexity,  we  could  restrict  the  minimizing 
player  to  a  nonlinear  subspace.  We  experimented  briefly  with  these  and  other 
approaches,  but  the  version  of  the  algorithm  given  here  is  the  one  that  seemed 
to  work  best. 

Yet  another  approach  is  suggested  by  the  correspondence  between  the  soft 
penalty  term  introduced  in  Section  4.4  and  maximum  likelihood  estimation. 
Negative  Bellman  residuals  in  an  MDP  program  with  a  soft  penalty  term  cor¬ 
respond  to  samples  in  a  maximum  likelihood  problem  that  have  low  probability 
under  the  best  model.  Such  samples  are  often  called  outliers,  under  the  as¬ 
sumption  that  they  were  generated  by  some  process  that  we  cannot  model.  In 
maximum  likelihood  estimation,  two  possible  responses  to  outliers  are  to  dis¬ 
card  them  or  to  add  additional  representational  power  to  the  model.  We  could 
apply  these  same  principles  to  solving  MDPs  by  either  discarding  transitions 
with  large  negative  residuals  or  adding  representational  power  to  our  model  of 
the  value  function. 


4.7  Implementation 

The  previous  section  outlined  at  a  high  level  the  choices  involved  in  designing 
an  algorithm  to  approximate  the  Bellman  linear  program  for  a  Markov  decision 
process.  This  section  describes  in  more  detail  the  implementation  we  used  to 
perform  our  experiments. 

4.7.1  Overview 

There  are  several  design  decisions  that  we  had  to  make  for  our  algorithm.  The 
first  is  how  to  represent  our  knowledge  about  the  Markov  decision  process, 
including  its  dynamics,  its  goals,  and  its  starting  state  frequencies.  We  chose 
to  represent  the  MDP’s  dynamics  and  goals  with  a  list  of  the  transitions  we 
have  sampled;  so,  for  each  transition,  we  store  its  one-step  cost  and  the  feature 
vectors  for  its  starting  and  ending  states.  To  represent  the  starting  frequencies, 
we  store  our  estimate  of  the  expected  feature  vector  for  a  state  chosen  from  the 
starting  distribution. 
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The  second  decision  is  what  representation  to  use  for  the  flows.  As  discussed 
in  Section  4.6.2,  we  want  to  restrict  the  minimizing  player  to  a  subspace  of  the 
possible  flow  vectors  in  order  to  counterbalance  the  fact  that  we  have  restricted 
the  maximizing  player  to  a  subspace  of  the  possible  value  functions.  We  can 
represent  the  allowable  subspace  of  flow  vectors  as  the  span  of  a  matrix  B, 

In  our  implementation  we  use  the  following  choice  for  B.  There  is  one  row 
for  each  transition  we  have  observed.  The  first  k  columns  of  the  row  contain 
the  feature  vector  for  the  starting  state  of  the  transition.  That  means  that  the 
first  k  columns  of  B  are  a  copy  of  A  with  some  rows  duplicated.  The  remaining 
m  —  n  columns  of  each  row  contain  either  one  or  two  nonzero  elements,  and 
are  used  to  chain  together  all  of  the  actions  that  have  the  same  starting  state. 
If  rows  ii  <  i2  <  •  •  •  <  ij  all  start  from  the  same  state,  there  will  be  a  1  in 
position  {k  -h  a  —1  in  position  {k  +  ^1,^2),  a  1  in  position  {k  +  ^2,^*2),  a 

—1  in  position  (fc  +  ^2,^3)?  and  so  on  until  a  in  position  {k  ’i-  This 

pattern  of  Is  and  —Is  for  a  single  starting  state  takes  up  one  fewer  column  than 
it  does  rows,  and  so  for  n  states  it  will  take  up  n  fewer  columns  than  rows. 

To  understand  this  choice  of  B^  consider  the  example  of  an  MDP  with  exactly 
three  actions  from  each  state.  If  we  sort  the  transitions  by  action,  then  by  state 
within  action,  B  will  have  the  block  representation 

/A  I  0  \ 

[  A  -I  I  \  (4.14) 

0  -I  J 

In  this  example,  as  in  general,  if  we  write  /  =  Bg  then  the  first  k  components 
of  g  assign  flow  equally  to  all  actions  with  the  same  starting  state,  while  the 
remaining  m  —  n  components  of  g  move  flow  around  between  actions  with  the 
same  starting  state.  As  we  can  see  from  the  example  in  (4.14),  the  last  m  —  n 
columns  of  B  are  very  sparse;  so,  since  an  m  x  (m  -  n)  matrix  is  expensive  to 
represent  we  will  store  only  the  nonzero  components  of  these  columns  of  B. 

The  final  decision  is  whether  to  apply  the  trick  described  in  Section  4.6.4  to 
make  sure  that  the  linear  program  is  feasible.  We  decided  not  to  do  so,  since  we 
wanted  to  include  information  about  transitions  that  we  did  not  follow.  Under 
the  scheme  of  Section  4.6.4,  such  transitions  would  receive  zero  weight  and  so 
would  convey  no  information.  We  did  not  observe  any  problems  with  infeasibil¬ 
ity,  but  it  could  still  be  that  reweighting  in  this  way  would  have  improved  our 
learning  performance. 

The  next  section  describes  our  implementation  in  more  detail. 


4.7.2  Details 

The  input  to  our  program  is  a  description  of  the  transitions  we  have  sampled 
from  the  Markov  decision  process  and  the  features  we  plan  to  use  to  approximate 
the  value  function.  More  specifically,  if  we  have  seen  m  transitions  from  n  states 
and  we  have  k  features,  the  input  will  comprise  the  following  objects  (described 
in  more  detail  below); 
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•  An  nx  k  dense  matrix  A, 


•  knmxk  dense  matrix  EA. , 

•  An  (m  ~  n  +  fc)  X  m  sparse  matrix  B, 

•  An  m-vector  c. 

•  A  fc- vector  s. 

By  a  dense  matrix  we  mean  one  where  we  represent  every  element  explicity, 
while  by  a  sparse  matrix  we  mean  one  where  we  represent  only  the  nonzero 
elements  to  save  space.  The  output  of  our  program  is  a  vector  of  parameters  w 
representing  our  learned  value  function. 

The  columns  of  the  matrix  A  are  the  basis  functions  we  intend  to  use  to 
represent  the  value  function.  In  other  words,  at  the  end  of  the  algorithm,  Aw  is 
our  best  estimate  of  the  true  value  function.  To  save  space,  we  do  not  represent 
the  rows  of  A  that  correspond  to  states  which  we  have  not  visited.  Each  row  of 
A  contains  the  values  of  our  k  features  or  basis  functions  at  a  single  state.  For 
example,  if  our  observed  states  were  the  real  numbers  a:i,  0:2, ...  and  we  wanted 
a  quadratic  approximation  to  the  value  function,  then  the  rows  of  A  would  be 
(l,a;i,rc?),(l,rr2,x^),.... 

The  matrix  EA  is  our  best  estimate  of  the  product  of  the  edge  matrix  E  with 
the  basis  matrix  A.  To  save  space,  we  remove  from  E  the  columns  corresponding 
to  states  we  have  not  visited  and  the  rows  corresponding  to  transitions  we 
have  not  visited.  So,  each  row  of  E  corresponds  to  a  single  transition  we  have 
observed:  if  we  observe  a  transition  from  state  i  to  state  j  then  the  corresponding 
row  of  E  will  have  a  in  the  zth  column  and  a  7  in  the  jth  column.  If  we 
know  not  just  a  single  destination  state  but  a  probability  distribution  p  with 
nonzero  mass  on  several  destination  states,  then  the  corresponding  row  of  E  will 
be  equal  to  7p  except  that  1  will  be  subtracted  from  the  ^th  column.  So,  each 
row  of  EA  contains  a  difference  between  feature  vectors  along  a  transition:  if 
we  observe  a  transition  from  state  i  to  state  j,  and  if  state  i  has  feature  vector 
Qi  and  state  j  has  feature  vector  aj,  then  the  corresponding  row  of  E  will  be 
qfaj  -  Of.  If  we  know  the  probability  distribution  p  over  possible  destination 
states  then  we  can  replace  yaj  by  its  expectation  under  p. 

The  matrix  B  plays  the  role  described  above:  we  restrict  the  minimizing  or 
flow  player  to  choose  a  vector  in  the  span  of  B.  Since  the  first  k  columns  of  each 
row  of  B  are  duplicated  from  A,  we  store  the  indices  into  A  instead  of  these 
columns;  and  since  the  remaining  m-n  columns  of  B  are  sparse,  we  store  these 
columns  as  a  list  of  their  nonzero  entries. 

The  vector  c  contains  the  cost  of  each  transition.  The  vector  is  the 
product  of  our  basis  matrix  A  with  the  vector  of  starting  frequencies  s.  We 
can  compute  -A^s  as  a  weighted  sum  of  the  rows  of  EA,  with  the  weight  of 
each  row  equal  to  the  number  of  times  we  have  traversed  the  corresponding 
transition. 
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Once  we  have  these  inputs  5  the  simplest  way  to  find  the  coefficients  of  the 
approximate  value  function  is  to  construct  the  linear  program 

minimize  c^f  subject  to  f  +  A^s  =  0,  /  =  Bg,  f  >0  (4.15) 

and  pass  it  to  a  prepackaged  linear  program  solver.  The  estimated  coefficients 
of  the  value  function  are  then  the  dual  variables  for  the  equality  constraints 
A^E^  f  +  A^s  =  0  (possibly  negated,  depending  on  how  the  prepackaged  solver 
defines  the  dual  variables).  This  approach  works  well  if  the  prepackaged  solver 
is  set  up  so  that  it  does  not  cause  too  much  fill-in  in  matrix  products  involving 
B. 

To  take  better  advantage  of  the  sparseness  in  B  we  have  implemented  an 
interior-point  barrier  method  linear  program  solver  customized  for  linear  pro¬ 
grams  of  the  type  (4.15).  Like  other  logarithmic  barrier  methods  (for  example 
[AGMX96,  Van94]  and  many  others),  our  implementation  approximately  solves 
a  sequence  of  convex  programs 

minimize  c^f  —  ^  In  fi  subject  to  E^ f  +  A^s  =  0,  /  =  Bg  (4.16) 


for  decreasing  values  of  the  barrier  parameter  fi.  The  barrier  parameter  serves  a 
similar  purpose  here  to  the  one  it  served  in  Section  4.4:  it  softens  the  constraints 
and  makes  the  convex  program  smoother.  Whereas  in  Section  4.4  we  wanted  to 
smooth  the  constraints  because  of  uncertainty  in  the  coefficients  of  the  linear 
program,  here  we  just  want  to  smooth  out  the  constraints  to  make  the  program 
easier  to  solve.  So,  we  will  start  with  a  large  value  of  /x,  then  try  to  track  the 
solution  to  (4.16)  as  we  decrease  /i  towards  zero. 

The  set  of  solutions  to  (4.16)  for  all  values  of  //  is  called  the  central  path. 
Figure  4.7  shows  a  fragment  of  the  central  path  for  the  simple  linear  program 

minimize  x  -\-y  subject  to  x  >  0, 2/  >  0,  a:  -I-  2^/  >  1 

As  the  figure  shows,  the  central  path  starts  out  far  from  any  of  the  constraint 
lines,  then  moves  smoothly  towards  the  optimal  solution. 

The  Lagrangian  for  Equation  4.16  is 

C^f  -  +  +  s)  +  z'^(/  -  Bg) 

i 

To  find  the  saddle  point  of  the  Lagrangian  we  can  set  its  derivatives  with  respect 
to  /,  g,  w,  and  z  to  zero.  The  resulting  nonlinear  equations  are 


0  = 

0  = 
0  = 
0  = 
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Figure  4.7:  The  central  path  for  a  linear  program. 


where  4  means  the  vector  whose  components  are  the  inverses  of  the  components 
of  /.  The  only  nonlinearity  above  is  in  the  first  equation.  To  make  the  math 
more  symmetric,  we  will  declare  a  new  m- vector  x  and  replace  the  first  equation 
with 


X  =  c  +  EAw  +  2 
M  =  fiXi  (Vi) 

If  we  have  initial  guesses  for  the  variables  /,  g,  w,  z,  and  x,  we  can  use 
Newton’s  method  to  find  an  update  direction  which  brings  them  closer  to  solving 
Equation  4.16.  That  is,  we  can  linearize  the  equations  around  our  current  values 
for  the  variables  and  solve  the  linearized  equations  for  the  update  direction.  We 
will  require  that  our  initial  guesses  for  /  and  x  are  strictly  positive;  the  updates 
we  describe  below  will  preserve  this  property. 

We  will  linearize  the  equation  p  =  ffXi  by  replacing  fi  with  /j  +  A/*  and  Xi 
with  Xi  +  Axi,  then  treating  fi  and  Xi  as  constants.  The  result  is 


fi  —  fiXi  +  fiAxi  +  XiAfi  +  hi 


where  hi  is  the  remaining  higher-order  term  that  depends  on  both  Axj  and 
Afi-  By  using  the  shorthand  that  F  and  X  stand  for  the  diagonal  matrices 
with  elements  /  and  x,  and  that  e  stands  for  the  vector  of  all  ones,  we  can  write 
the  linearized  equations  as 

fie^  h^  Fx  =  FAx  +  XAf 


The  remaining  equations  are  already  linear,  but  to  keep  the  notation  consis¬ 
tent  we  will  replace  g,  w,  and  z  by  g  -I-  Ag,  w  -j-  Aw,  and  z  -I-  Az  and  treat  g,  w, 
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constants.  Now 

we  can  collect  all  of  the  equations 

0 

EA 

I 

-I 

0  N 

( \ 

0 

0 

0 

0 

Aw 

I 

0 

0 

0 

-B 

Az 

-I 

0 

0 

-X-'^F 

0 

Ax 

.  0 

0 

-bt 

0 

0 

\  ^9  J 

into  one  big  array: 


(4.17) 


We  have  avoided  writing  the  constant  on  the  right  hand  side  because  it  would 
be  a  complicated  expression  and  its  exact  form  does  not  affect  the  following 
discussion.  The  matrix  in  (4.17)  is  symmetric  and  quasidefinite,  which  means 
that  we  can  factor  it  by  an  algorithm  similar  to  Cholesky  decomposition;  the 
only  difference  is  that  some  of  the  pivots  during  the  decomposition  will  be 
negative,  so  we  will  represent  our  factorization  as  LDL^  (where  L  is  a  lower 
triangular  matrix  and  D  is  a  diagonal  matrix)  instead  of  incorporating  y/D  into 
L. 

Since  the  matrix  in  (4.17)  is  very  sparse  (many  of  its  blocks  are  identically 
zero,  others  are  diagonal,  and  the  matrix  B  is  sparse)  we  need  to  take  care  when 
factoring  it  not  to  introduce  too  much  fill-in.  So  we  will  factor  it  partway  by 
hand  before  giving  it  to  our  LDL^  factorizer.  First  we  can  use  the  fourth  block 
row  of  the  matrix  to  eliminate  the  fourth  block  column,  leaving  the  equations 


/  F-^X  EA  I 

0  0 

/  0  0 

\  0  0 


0 

0 

-jB 

0 


/  A/  \ 

Aw 

Az 

\  ^9  J 


This  step  causes  no  off-diagonal  fill-in.  Next  we  can  eliminate  the  first  block 
row  and  column,  leaving 


/  -A^E'^X-^FEA  -A^E'^X-^F  0  \  (  Aw  \ 

I  ^X-'^FEA  -B  Az  =  . .  - 

V  0  -BT  0  J  \  Ag  J 


This  step  causes  some  fill-in,  but  since  EA  is  tall  and  narrow  and  X~^F  is 
diagonal  the  required  computation  is  not  large.  Next  we  will  eliminate  the 
second  block  row  and  column,  leaving 

0  A^E^B  \  f  Aw  \  _ 

B^EA  B'^F-^XB  )  \  Ag 

Since  B  is  sparse,  we  have  to  worry  about  whether  this  step  causes  fill-in.  The 
matrix  B^EA  is  smaller  than  EA,  so  we  don’t  need  to  worry  about  fill-in  in 
this  block.  To  analyze  the  block  B^F~^XB,  first  suppose  that  B  has  the  form 
given  in  Equation  4.14.  Then  if  we  divide  F~^X  into  a  three  by  three  block 
matrix  with  diagonal  blocks  Di,  and  Ds,  B^F~^XB  is  equal  to 

A^{DxAD2ADs)A  A'^{Di-D2)  A^{D2-Ds) 

{Di  —  D2)A  Di  D2  — Z)2 

{D2  —  D^)A  —D2  D2  +  -D3 
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More  generally,  we  can  divide  B  into  k  dense  columns  and  m-n  sparse  columns, 
so  is  of  the  form 

/  small  and  dense  (narrow  and  dense)^ 
y  narrow  and  dense  large  and  sparse 

The  dot  product  between  two  different  sparse  columns  is  nonzero  only  if  the 
two  columns  correspond  to  adjacent  transitions  from  the  same  state,  so  each 
column  of  the  large,  sparse  block  has  at  most  one  nonzero  element  above  the 
diagonal  and  at  most  one  below.  That  means  that  we  still  have  not  caused  any 
unacceptable  fill-in. 

Finally,  for  our  last  step  before  passing  the  matrix  to  the  LDL'^  factorizer, 
we  can  pivot  along  the  diagonal  of  the  large  sparse  block  of  B'^p-'^XB.  The 
result  is  a  completely  dense  symmetric  matrix  with  four  k  x  k  blocks.  Since 
k,  the  number  of  basis  vectors,  is  small,  we  can  factorize  this  matrix  cheaply. 
Using  this  factorization  we  can  compute  Aw  and  the  first  k  components  of  Ap; 
then  we  can  substitute  backwards,  undoing  each  of  the  eliminations  described 
above,  to  compute  the  remaining  components  of  the  update  direction. 

Once  we  have  the  update  direction  vector,  telling  us  how  to  change  our 
estimates  of  f,  g,  w,  z,  and  x  to  move  closer  to  a  solution  of  Equation  4.16,  we 
need  to  decide  how  far  to  move  our  estimates  in  this  direction.  In  other  words, 
we  need  to  compute  a  step  length  A  e  [0, 1]  which  tells  us  what  fraction  of  the 
computed  update  vector  to  add  to  our  estimates.  Usually  a  step  length  of  A  =  1 
is  too  long,  since  it  will  cause  some  components  of  /  or  x  to  become  zero  or 
negative.  So,  we  compute  the  longest  step  Ao  which  keeps  /  and  x  positive; 
then  we  use  a  step  length  which  is  the  smaller  of  1  and 

.666Ao  +  {0-  .666)Ag  (4.18) 

The  parameter  0  e  (0, 1)  controls  how  aggressively  we  try  to  approach  the 
boundary;  we  use  0  =  .99995.  The  motivation  for  (4.18)  is  that  the  true  solution 
has  /  and  x  nonnegative,  so  the  farther  past  the  constraints  /  >  0  and  x  >  0 
the  update  vector  tries  to  take  us,  the  less  we  should  believe  it.  Equation  4.18 
produces  a  conservative  steplength  near  .666  if  the  update  vector  would  drive  / 
or  X  far  past  their  constraints,  while  it  produces  an  aggressive  steplength  of  0  if 
the  update  direction  vector  brings  /  or  x  exactly  to  the  border  of  the  positive 
orthant. 

The  foregoing  discussion  describes  how  to  update  /,  g,  w,  z,  and  x  if  we 
know  the  value  of  the  barrier  parameter  g  and  the  higher-order  term  h.  To 
estimate  g  and  h  we  use  a  second-order  predictor-corrector  method.  We  start 
by  computing  the  update  and  step  length  for  ^  =  0  and  h  =  0.  (This  update  is 
called  the  predictor  step.)  Then  we  estimate  the  higher  order  term  by 

hi  =  AfiAxi 

where  Af  and  Ax  are  the  predictor  updates  to  /  and  x,  Next  we  compute  a 
target  ghy  a.  heuristic  called  Mebrotra’s  rule.  Mehrotra’s  rule  is  based  on  the 
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observation  that  in  the  optimal  solution  to  Equation  4.16  we  have  f^x  =  m/i. 
So,  we  can  use  as  an  estimate  of  the  barrier  parameter  fi  that  produced  our 
current  values  for  /  and  x.  We  always  want  to  try  to  lower  //;  to  determine  how 
much  to  lower  it  we  compute 


(/  +  AA/)T(a:  +  AAa:) 

Ml  = - 

m 

where  A  is  the  step  length  from  the  predictor  step  and  A/  and  Ax  are  the 
predictor  updates  to  /  and  x.  The  values  mo  and  //i  are  the  estimated  barrier 
parameters  before  and  after  the  predictor  step.  Then  we  set  the  target  barrier 
to  be 


M  =  Mi 


This  choice  of  target  tries  to  lower  /x  somewhat  more  than  the  predictor  step 
alone  would  have.  Finally  we  use  these  values  for  //  and  h  as  our  estimates  of  the 
barrier  parameter  and  higher  order  term  to  compute  the  actual  update  vector 
and  step  length.  (The  change  between  the  predictor  step  and  the  actual  update 
vector  is  called  the  corrector  step.)  In  order  to  use  a  second-order  strategy  like 
this  one  we  have  to  solve  the  system  of  equations  (4.17)  twice  with  different 
right-hand  sides;  this  does  not  cause  too  much  extra  work  since  we  can  save  the 
factorization  from  the  first  time  and  reuse  it  the  second. 

In  order  to  completely  specify  our  algorithm,  we  need  to  pick  initial  estimates 
for  /,  p,  ly,  z,  and  x.  We  choose  the  very  simple  initialization  /^  =  =  1  and 

p  =  0,  ly  =  0,  z  =  0. 


4.8  Experiments 

This  section  describes  three  experiments  with  the  algorithm  of  Section  4.7.  The 
first  experiment  is  very  simple  and  is  just  intended  as  a  sanity  check;  the  other 
two  are  with  larger  and  more  interesting  MDPs. 

4.8.1  Tiny  MDP 

The  MDP  for  this  experiment  consists  of  50  states  in  a  line.  The  actions  are  to 
go  one  state  left  or  right.  Moving  off  the  end  of  the  line  ends  the  process.  The 
cost  of  each  action  is  randomly  selected  before  the  beginning  of  the  experiment 
from  a  normal  distribution  with  mean  1  and  variance  .3,  and  remains  fixed  and 
deterministic  thereafter. 

Figure  4.8  shows  the  exact  value  function  (large  dots)  and  a  quadratic  ap¬ 
proximation  to  it  (solid  line).  The  quadratic  approximation  was  computed  by 
the  algorithm  of  Section  4.7.  For  comparison,  Figure  4.8  also  shows  (dashed  line) 
the  least-squares  fit  to  the  exact  value  function.  As  we  expect,  the  least  squares 
fit  is  nearer  to  the  exact  value  function  than  the  solution  from  the  algorithm  of 
Section  4.7,  but  not  by  much. 


99 


20 


15 

10 

5 


Figure  4.8:  Value  functions  for  MDP  with  50  states  in  a  line. 


4.8.2  Tetris 

The  game  of  Tetris,  shown  in  Figure  4.9,  is  played  on  a  board  10  squares  wide 
and  h  squares  tall  (we  used  h  =  16).  Each  square  of  the  board  is  either  empty 
or  full.  In  the  space  above  the  board  the  player  is  given  one  new  piece  at  a 
time.  Each  piece  consists  of  four  filled  squares  arranged  in  one  of  the  seven 
possible  tetrominos  (L,  backwards  L,  S,  Z,  T,  I,  and  square).  Depending  on 
which  type  of  piece  is  showing,  the  player  has  up  to  34  possible  actions:  each 
action  consists  of  placing  the  piece  in  a  particular  orientation  and  horizontal 
position  and  dropping  it.  The  edges  of  the  piece  are  not  allowed  to  extend 
beyond  the  left  or  right  boundaries  of  the  board.  Once  dropped,  the  piece  falls 
straight  downwards  until  its  path  is  blocked  by  a  filled  square,  at  which  point 
it  stops  moving  and  a  new  piece  appears  above  the  board.  If  the  piece  cannot 
move  downward  so  that  it  is  contained  entirely  within  the  board,  the  game  is 
over.  If  at  any  point  an  entire  row  of  the  board  is  filled  (that  is,  if  there  are  ten 
horizontally  adjacent  filled  squares)  then  that  row  disappears,  the  rows  above  it 
move  down,  and  a  new  empty  row  appears  at  the  top  of  the  board  to  keep  the 
height  constant.  The  player  scores  one  point  for  every  row  removed  this  way. 

Tetris  is  a  Markov  decision  process;  the  state  consists  of  the  arrangement 
of  empty  and  filled  squares  (2^^^  possibilities)  and  the  type  of  piece  showing 
(7  possibilities).  The  actions  firom  each  state  are  the  possible  positions  and 
orientations  from  which  to  drop  the  piece.  The  actions  have  stochastic  outcomes: 
while  the  motion  of  the  piece  and  the  scoring  are  deterministic,  the  type  of  the 
next  piece  is  chosen  uniformly  at  random  from  the  possible  types.  We  chose  a 
discount  factor  of  7  =  ,99. 

The  human  version  of  Tetris  has  several  differences.  First,  there  are  more 
states  but  fewer  actions:  the  piece  is  shown  moving  down  the  board  one  row 
at  a  time,  with  enough  time  between  downward  motions  to  allow  for  several 
actions.  The  actions  are  to  move  the  piece  left  or  right  one  square,  to  turn  it 
90°  counterclockwise,  or  to  do  nothing.  Second,  the  human  version  has  h  =  20 
instead  of  h  =  16.  Finally,  the  scoring  for  the  human  version  is  more  com¬ 
plicated,  containing  bonuses  for  achievements  such  as  placing  pieces  quickly  or 


100 


Figure  4.9:  The  game  of  Tetris. 


removing  several  rows  of  filled  squares  at  once.  We  chose  the  nonhuman  version 
of  Tetris  for  several  reasons:  except  for  differences  in  h  it  is  the  same  version 
used  in  previous  research  [TV94,  BI96];  it  takes  less  computation  per  trial  so 
our  experiments  can  run  faster;  the  lower  height  causes  lower  scores  which  also 
lets  our  experiments  run  faster;  and  it  appears  to  be  easier  for  the  computer  to 
learn. 

We  chose  a  very  simple  representation  for  Tetris’s  value  function,  a  linear 
combination  of  just  five  features.  All  features  were  set  to  zero  for  game  over, 
thus  fixing  the  value  of  an  ended  game  at  zero.  For  a  game  in  progress,  the 
features  were: 

Constant  Always  equal  to  1. 

Average  height  The  average  height  of  the  highest  filled  block  in  each  of  the 
ten  columns. 

Maximum  height  The  maximum  height  of  any  filled  block. 

Airspace  The  total  number  of  empty  blocks  that  appear  anywhere  below  a 
filled  block  in  the  same  column. 

Bumpiness  The  sum  of  the  nine  absolute  differences  between  the  heights  of 
adjacent  columns. 

These  features  span  a  subspace  of  the  features  used  in  [BI96].  Although  this 
representation  is  simple,  it  contains  value  functions  whose  greedy  policies  are 
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good  Tetris  players:  the  best  learned  players  below  scored  hundreds  of  rows  in 
an  average  game. 

One  possible  source  of  confusion  about  this  representation  is  that  it  does 
not  encode  the  type  of  the  currently  falling  piece.  This  fact  does  not  prevent 
a  greedy  policy  from  taking  the  current  piece  into  account  when  it  chooses  an 
action:  since  the  greedy  policy  is  the  result  of  a  one-step  lookahead,  the  current 
piece  type  affects  the  choice  of  action  by  determining  the  set  of  possible  next 
states  for  the  lookahead. 

We  compared  the  performance  of  two  algorithms  on  this  task.  Both  algo¬ 
rithms  used  the  representation  described  above  for  the  value  function.  The 
first  algorithm  was  an  approximate  variant  of  policy  iteration.  We  chose  this 
algorithm  because  we  believe  that  most  researchers  would  accept  it  as  a  rea¬ 
sonable  standard  of  comparison.  In  the  approximate  policy  iteration  algorithm, 
we  played  groups  of  five  games  using  the  same  policy.  After  each  group  we  ran 
the  LS-TD(O)  algorithm  (described  in  Section  5.3.4)  on  that  group’s  training 
examples  to  learn  an  approximate  value  function  for  the  corresponding  policy. 
After  learning  we  switched  to  a  new  policy  and  threw  away  all  previous  training 
examples.  To  determine  what  policy  to  follow,  we  kept  a  running  average  of  all 
of  the  value  functions  computed  so  far,  and  always  acted  greedily  with  respect 
to  that  average  value  function. 

The  second  algorithm  was  the  one  described  in  Section  4.7.  To  make  the 
comparison  between  the  two  algorithms  as  easy  as  possible,  we  kept  as  many 
algorithmic  details  as  possible  the  same.  So,  we  played  groups  of  five  games 
using  the  same  policy,  we  threw  away  all  training  data  every  time  we  switched 
to  a  new  policy,  and  we  always  acted  greedily  according  to  the  average  of  all 
value  functions  computed  so  far.  Instead  of  using  LS-TD(O)  to  compute  the 
value  function  after  each  group  of  games,  though,  we  solved  a  linear  program  as 
described  in  Section  4.7.  Because  we  kept  so  many  algorithmic  details  the  same 
for  the  two  algorithms,  we  could  switch  between  them  by  changing  only  a  few 
lines  of  code. 

To  evaluate  the  performance  of  each  algorithm  we  simply  started  it  playing 
Tetris  and  recorded  its  scores.  Figure  4.10  shows  a  plot  of  each  algorithm’s  score 
as  a  function  of  how  many  groups  of  five  games  it  had  played.  The  plot  is  the 
average  of  five  runs  for  each  algorithm,  and  each  point  in  a  run  is  the  average 
of  the  scores  for  the  five  games  in  a  single  group.  This  type  of  plot  tends  to 
accentuate  differences  between  algorithms,  since  better  algorithms  will  achieve 
longer  games  sooner  and  so  will  have  access  to  more  training  data. 

As  Figure  4.10  shows,  the  linear  programming  algorithm  manages  to  learn 
a  decent  Tetris  player,  but  it  does  not  achieve  the  performance  of  approximate 
policy  iteration.  Section  4.8.3  explores  some  possible  reasons  for  this  behavior. 

We  examined  the  weight  vectors  learned  by  the  two  algorithms,  and  they 
were  substantially  different.  To  check  whether  the  difference  might  have  been 
caused  by  slow  convergence  or  local  optima,  we  started  the  linear  program¬ 
ming  algorithm  from  the  weight  vector  learned  by  approximate  policy  iteration. 
Within  a  few  groups  of  games,  the  linear  programming  algorithm  had  moved 
away  from  its  starting  vector  and  back  towards  the  answer  it  had  converged  to 
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Figure  4.10:  Performance  of  two  algorithms  for  playing  Tetris.  Heavy  line: 
linear  programming.  Light  line:  policy  iteration. 


from  the  original  starting  point. 

4.8.3  Hill-car 

We  hypothesized  that  the  linear  programming  algorithm’s  difficulty  in  learning 
Tetris  was  caused  by  trying  to  reason  about  transitions  that  led  to  states  we  had 
never  visited.  Since  the  learner  has  no  direct  constraints  on  the  values  of  these 
states,  and  since  their  representations  may  be  outside  the  convex  hull  of  the 
representations  of  the  visited  states,  we  thought  that  trying  to  infer  the  values 
of  these  states  might  cause  instability.  Unfortunately,  in  a  typical  application, 
there  is  no  way  to  avoid  reasoning  about  unvisited  states:  the  learner  simply 
does  not  have  time  to  explore  every  transition,  so  if  we  discard  transitions  that 
we  have  not  followed,  we  will  be  reduced  to  a  single  transition  out  of  most  states. 

In  a  small  MDP,  though,  it  is  possible  to  visit  every  state.  So  to  verify  our 
hypothesis,  we  performed  an  experiment  on  a  much  simpler  MDP.  We  expected 
that,  if  the  unvisited  states  were  causing  our  problems,  the  learning  performance 
would  start  out  poor  (worse  than  could  be  explained  just  by  lack  of  data),  then 
improve  rapidly  as  our  sample  size  increased,  and  finally  become  acceptable 
once  we  had  visited  most  or  all  of  the  states  in  the  state  space. 

For  this  experiment  we  took  the  hill-car  problem  from  Section  2.5.2,  changed 
the  time  increment  to  .Is,  and  reduced  the  state  space  to  [— 1,.7]  x  [^2,2] 
(corresponding  to  position x velocity).  Then  we  discretized  the  state  space  to 
a  20  X  20  grid  using  bilinear  interpolation.  The  result  is  a  400-state,  800-edge 
discrete  MDP.  Each  row  of  the  edge  matrix  for  this  MDP  has  up  to  five  nonzero 
entries:  one  negative  entry  for  the  state  at  time  t,  and  up  to  four  positive  entries 
for  the  possible  states  at  time  t  -h  1. 

We  collected  data  by  following  a  fixed  policy:  always  thrust  right.  Since  the 
goal  is  to  get  to  any  position  greater  than  .6,  this  policy  is  optimal  from  any 
state  where  the  car  has  sufficient  momentum  to  reach  the  top  of  the  hill,  but 
will  never  terminate  if  the  car  does  not  start  out  with  enough  momentum,  To 
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Figure  4.11:  The  exact  value  function  for  the  hill-car  MDP  and  a  spline  approx¬ 
imation  to  it. 


avoid  infinite  trajectories,  we  terminated  a  trajectory  with  probability  .01  on 
each  time  step,  corresponding  to  a  discount  factor  of  7  =  .99. 

We  represented  the  value  function  by  storing  the  estimated  values  at  49 
states  on  a  7  X  7  grid,  then  interpolating  in  each  direction  with  a  cubic  spline. 
In  other  words,  we  used  49  basis  functions,  each  of  which  was  the  product  of  a 
cubic  spline  depending  only  on  position  with  a  cubic  spline  depending  only  on 
velocity. 

In  each  run  of  our  experiment  we  collected  seven  trajectories  at  a  time, 
then  fed  all  of  the  trajectories  so  far  to  our  learning  algorithm.  During  each 
trajectory  we  recorded  all  available  actions  firom  each  state  we  visited.  We  never 
changed  policies,  and  we  never  threw  away  any  data.  After  each  invocation  of 
the  learning  algorithm,  we  recorded  both  whether  it  converged  and  what  weight 
vector  it  converged  to.  We  collected  twenty  groups  of  trajectories,  for  a  total  of 
140  trajectories  per  run. 

We  ran  the  experiment  five  times.  Each  time,  after  we  had  collected  all  140 
trajectories,  the  learning  algorithm  was  able  to  find  a  good  approximation  to 
the  true  value  function.  Figure  4.11  shows  the  true  value  function  and  a  typical 
approximation  to  it. 

In  three  of  the  five  runs,  though,  the  linear  programming  algorithm  did 
not  converge  within  75  iterations  of  the  interior-point  method  when  given  data 
from  only  the  first  group  of  seven  trajectories.  In  one  of  these  three,  it  also  did 
not  converge  when  given  data  firom  the  first  two  groups.  In  fact,  on  all  runs, 
the  hnear  program  shows  signs  of  ill-conditioning  when  data  are  scarce,  either 
by  lack  of  convergence  or  by  convergence  to  an  answer  with  a  large  2-norm. 
Figure  4.12  shows  a  typical  example  of  the  latter.  The  value  function  shown 
in  the  figure  is  based  on  data  from  three  groups  of  trajectories;  notice  that 
the  estimated  values  are  off  of  the  plot  scale  at  two  corners  of  the  state  space, 
(.6,-2)  and  (—1,2),  where  the  data  are  particularly  sparse.  The  full  range  of 
this  learned  value  function  is  [-52.23, 15.22]. 

On  the  other  hand,  all  runs  converged  consistently  to  value  functions  with 
about  the  right  two-norm  after  they  had  seen  at  least  fifteen  groups  of  seven  tra¬ 
jectories.  We  believe  that  this  behavior  supports  our  hypothesis  that  transitions 
ending  in  unvisited  states  tend  to  cause  instability. 
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Figure  4.12:  A  value  function  learned  from  sparse  data. 


4.9  Discussion 

In  this  chapter  we  have  examined  the  connections  among  Markov  decision  pro¬ 
cesses,  linear  and  convex  programming,  and  maximum  likelihood.  Based  on 
our  analysis  we  have  recommended  a  method  for  designing  value-function  ap¬ 
proximating  algorithms:  substitute  an  approximate  representation  for  the  value 
function  into  the  Bellman  linear  program,  then  add  a  penalty  term  to  the  dual 
of  the  Bellman  program.  We  have  coded  a  fast  implementation  of  one  such 
algorithm,  and  experimented  with  this  implementation.  While  the  learning 
performance  of  this  algorithm  does  not  improve  on  the  best  prior  algorithms, 
we  hope  that  the  intuition  and  design  methodology  of  this  chapter  can  aid  in 
the  design  of  other  algorithms  for  solving  MDPs. 
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Chapter  5 

RELATED  WORK 
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This  chapter  is  a  brief  summary  of  related  reading  on  Markov  decision  pro¬ 
cesses.  It  starts  by  considering  methods  for  solving  small  MDPs  exactly,  such 
as  value  iteration,  policy  iteration,  and  linear  programming.  Next  it  discusses 
exact  methods  for  solving  special  cases  of  MDPs  like  linear-quadratic-Gaussian 
processes  and  continuous-time  problems  that  are  linear  in  their  controls.  Then 
it  considers  a  variety  of  methods  for  approximating  value  functions.  These 
methods  range  from  simple  interpolation  on  a  regular  grid  to  neural  networks 
trained  by  gradient  descent.  Finally  it  describes  incremental  algorithms  for 
solving  MDPs. 


5.1  Discrete  problems 

One  of  the  first  algorithms  for  solving  Markov  decision  process  was  the  Bellman- 
Ford  single-destination  shortest  paths  algorithm  [Bel58,  FF62],  which  learns 
paths  in  a  graph  (ie.,  a  deterministic  undiscounted  MDP)  by  repeatedly  up¬ 
dating  the  estimated  distance-to-goal  for  each  node  based  on  the  distances  for 
its  neighbors.  The  Bellman-Ford  algorithm  is  a  special  case  of  value  iteration, 
which  is  defined  in  Chapter  1.  For  other  early  work  on  similar  algorithms 
see  [Bel61,  Bla65]. 

Besides  value  iteration,  another  good  way  to  solve  small  MDPs  is  policy 
iteration.  Policy  iteration  maintains  a  current  policy  on  each  step  i.  It 
solves  the  equation 

v-T^(i)V 

on  each  step,  setting  to  be  the  solution,  and  then  computes  to  be 

the  greedy  policy  for  Policy  iteration  often  takes  many  fewer  steps  to 

converge  than  value  iteration,  but  each  step  requires  more  work.  For  a  proof  of 
the  convergence  of  policy  iteration  see  [BT89]. 

Midway  between  value  iteration  and  policy  iteration  lies  modified  policy 
iteration.  In  MPI  we  store  both  a  current  value  function  and  a  current 
policy  On  each  step  we  compute  the  next  value  function  from 

by  the  backup  operator  for  the  current  policy,  =  Tj^(i)Vi.  On  some  steps 

we  set  to  be  the  greedy  policy  for  as  value  iteration  does,  but 

on  other  steps  we  just  keep  The  relative  frequency  of  these  two 

types  of  step  is  a  parameter  of  the  algorithm:  if  we  always  choose  the  greedy 
policy,  MPI  reduces  to  value  iteration,  while  if  we  usually  keep  the  policy  from 
the  previous  step,  MPI  behaves  more  like  policy  iteration. 

Even  more  generally,  we  could  store  separately  a  value  and  an  action  for 
each  state,  and  on  each  step  improve  some  of  the  values  (by  setting  them  to 
the  result  of  the  value  backup  operator  for  the  current  policy)  and  some  of  the 
actions  (by  setting  them  to  the  action  for  the  greedy  policy).  By  choosing  an 
order  of  updates  we  can  produce  the  value  iteration  algorithm,  the  modified 
policy  iteration  algorithm,  and  other  algorithms  in  between.  As  long  as  we 
update  all  actions  and  values  often  enough,  the  resulting  algorithm  converges 
(see  [BT89]). 
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Finally,  we  can  solve  small  MDPs  by  converting  them  to  linear  programs  as 
described  in  Chapter  4,  then  solving  the  linear  programs  with  simplex,  barrier 
methods,  or  other  linear  programming  algorithms  [Ber76,  p248]  [Ros83,  p40]. 
An  MDP  which  takes  a  long  time  to  solve  by  value  iteration  can  sometimes  take 
much  less  time  to  solve  by  linear  programming,  and  vice  versa.  See  [TZ93,  TZ95, 
TZ97]  for  some  comparisons  between  linear  programming  and  value  iteration. 

5.2  Continuous  problems 

In  the  previous  section  we  described  several  ways  to  find  the  exact  value  function 
for  a  sufficiently  small,  discrete  MDP.  None  of  the  methods  of  the  previous 
section  is  appropriate  for  solving  MDPs  with  continuous  state  spaces.  To  solve 
such  an  MDP ,  we  must  turn  either  to  special  cases  or  to  approximate  methods. 

Approximate  methods  for  solving  continuous  MDPs  are  similar  to  approx¬ 
imate  methods  for  solving  large,  discrete  MDPs,  so  we  will  put  off  discussing 
them  until  Section  5.3.  The  rest  of  this  section  describes  some  special  cases  of 
MDPs  with  continuous  state  spaces  that  we  know  how  to  solve  exactly. 

5.2.1  Linear-Quadratic-Gaussian  MDPs 

One  well-studied  special  case  of  continuous  Markov  decision  processes  is  the 
linear-quadratic-Gaussian  problem,  where  the  transition  function  is  linear  in 
the  states  and  controls,  the  cost  function  is  quadratic,  and  all  noise  is  Gaussian 
additive.  The  value  function  for  an  LQG  problem  is  always  quadratic,  with 
coefficients  given  by  a  set  of  linear  equations  called  the  Ricatti  equations;  so, 
we  can  solve  even  high-dimensional  LQG  problems  easily.  (In  fact,  hidden  state 
makes  LQG  problems  only  slightly  more  difficult.) 

Even  if  a  problem  does  not  appear  linear  at  first  glance,  it  is  sometimes 
possible  to  make  it  linear  by  a  transformation  of  the  state  and  control  variables. 
Problems  which  may  be  so  transformed  are  called  feedback  linearizable  (see 
[SL91]  for  more  detail).  One  important  example  of  a  feedback-linearizable  model 
is  an  idealized  multi-link  robot  arm;  for  this  model,  feedback  linearization  is 
often  called  the  “method  of  computed  torques.”  Of  course,  if  the  original  model 
contains  errors  (for  example,  friction  or  backlash  in  the  robot  arm),  so  will  the 
linearized  model.  In  fact,  the  errors  in  the  linearized  model  can  be  worse,  since 
the  computed  control  input  may  need  to  be  very  large  to  cancel  the  original 
model’s  nonlinearities.  Another  possible  source  of  problems  is  that  quadratic 
costs  and  Gaussian  errors  may  no  longer  be  quadratic  and  Gaussian  after  the 
transformation. 


5.2.2  Continuous  time 

Many  MDPs  with  continuous  state  spaces  evolve  in  continuous  time  rather  than 
in  discrete  steps.  For  such  an  MDP  it  is  natural  to  write  the  value  function  as  the 
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solution  to  a  differential  equation.  To  do  so,  we  must  make  some  assumptions, 
the  most  important  of  which  is  that  the  MDP  is  deterministic. 

If  the  state  x{t)  of  an  MDP  evolves  according  to 


dx 

dt 


=  f{x,u) 


where  u{t)  is  the  control  input,  and  if  the  cost  of  a  path  x{t)  under  control  u{t) 
is  /  c{x{t),u{t))dt,  then  the  value  function  satisfies  the  diflferential  equation 


min 

u 


f{x,u)  +  c{x,u) 


(5.1) 


Equation  5.1  is  called  the  steady-state  Hamilton-Jacobi-Bellman  or  HJB  equa¬ 
tion  [Ber95].  To  ensure  that  the  HJB  equation  has  a  unique  solution,  we  must 
specify  sufficiently  many  boundary  conditions. 

For  some  MDPs  we  may  be  able  to  solve  the  HJB  equations  analytically.  It 
is  easiest  to  solve  the  HJB  equations  analytically  for  Markov  processes:  since 
Markov  processes  allow  only  one  choice  of  control  u,  the  minimization  over  u  is 
unnecessary  and  so  the  HJB  equations  are  linear. 

The  paper  [MM98]  describes  how  to  use  a  subset  of  averagers  called  barycen- 
tric  interpolators  to  solve  continuous-time  Markov  decision  processes.  The  es¬ 
sential  feature  is  that  the  authors  add  a  requirement  to  the  averager  which 
ensures  that,  as  the  representational  power  of  the  averager  grows,  the  fixed 
point  of  fitted  value  iteration  converges  to  the  true  value  function. 

The  following  section  describes  a  different  approach  to  finding  the  best  con¬ 
trol  for  a  continuous-time  MDP. 


5.2.3  Linearity  in  controls 

Consider  the  single-input,  single-output,  nth  order  system 

n 

X  =  a{x)  -h  b{x)u 

where  x  is  a  vector  whose  components  are  x  and  its  time  derivatives  up  to  order 
n  —  1,  a  and  b  are  (possibly  nonlinear)  functions  of  x,  and  b{x)  is  bounded  away 
from  zero.  (For  a  generalization  of  the  contents  of  this  section  to  systems  with 
k  inputs  and  k  independent  outputs,  see  for  example  [SL91].) 

Our  goal  in  this  section  will  be  to  supply  an  input  u{t)  so  that  the  output 
x{t)  tracks  a  given  reference  signal  Xd{t)  as  closely  as  possible.  This  goal  is 
less  general  than  controlling  an  arbitrary  MDP  in  four  important  ways:  first, 
we  have  replaced  a  general  cost  function  by  the  simpler  objective  of  tracking 
a  known  reference  signal.  Second,  we  have  assumed  that  the  system  to  be 
controlled  is  deterministic.  Third,  we  have  assumed  that  the  system  is  linear 
in  the  control.  Fourth,  we  have  assumed  that  the  number  of  control  inputs 
is  equal  to  the  number  of  independent  outputs.  The  last  two  assumptions  in 
particular  are  often  unrealistic,  since  they  allow  us  to  cancel  an  arbitrary  drift  by 
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choosing  a  sufficiently  large  control  input.  For  example,  the  third  assumption 
is  violated  by  a  robot  whose  actuators  can  only  exert  a  bounded  amount  of 
force  (but  see  [YSS97]  for  a  treatment  of  linear  systems  with  bounded  controls); 
the  fourth  is  violated  by  the  well-known  cart-pole  problem,  which  is  to  control 
the  angle  of  a  pole  and  the  position  of  its  base  (two  independent  outputs)  by 
exerting  a  horizontal  force  at  the  base  (one  input). 

One  benefit  of  making  these  simplifying  assumptions  is  that  we  will  be  able 
to  derive  a  controller  which  succeeds  even  if  we  replace  a  and  b  by  estimates  a 
and  b  with  bounded  error.  (Such  a  controller  is  called  robust.)  These  estimates 
might  come,  for  example,  from  a  supervised  learner  trained  on  observed  system 
trajectories. 

Before  we  consider  robust  control,  we  will  derive  a  controller  for  use  when 
we  know  o,  and  b  exactly.  To  that  end,  write  6  =  x  —  for  our  tracking  error, 
and  let 


where  a;  is  a  positive  constant.  The  combined  error  measure  s  is  a  linear  combi¬ 
nation  of  our  tracking  error  and  its  derivatives;  its  importance  is  that,  if  we  man¬ 
age  to  achieve  s  =  0,  the  tracking  error  e  must  converge  exponentially  to  zero. 
To  see  why,  consider  the  solutions  of  the  differential  equation  (J^  +w)'^~^e  =  0. 
The  polynomial  {x  w)"“^  has  all  of  its  roots  at  — w.  So,  the  norm  of  any 
solution  (e,  ^, . . . ,  )  must  behave  like  exp(— wt).  Since  w  >  0,  this  means 

that  the  solutions  all  decay  to  zero  with  time  constant  i. 

n-1 

If  we  define  r  so  that  s  =  4-  r,  we  can  solve  for  u  in  terms  of  ^  and 

known  quantities: 


If  we  start  out  with  the  combined  error  measure  s  at  zero,  we  can  find  the  u 
which  maintains  s  =  0  by  setting  §  =  0  in  the  above  equation.  More  generally, 
if  s  7^  0,  we  can  use  a  simple  PD  controller  to  reduce  s  by  setting  ^  =  -ks  for 
some  positive  constant  k.  The  resulting  u  will  cause  s  to  decay  exponentially 
to  zero. 

For  example,  suppose  we  want  to  control  the  system  x  =  cos  (exp  a:)  to  track 
sint  starting  from  x  =  x  =  0.  If  we  choose  lo  =  \^  the  combined  error  measure 
iss  =  e-|-e  =  (ir  —  cos t)  -h  (a;  ~  sin t) ,  and  the  recommended  control  input  is 
—  sin^  —  (a:  “  cost)  —  ks  —  cos(expa:).  If  we  choose  A:  =  1,  we  get 

u  =  2  cos  t  —  2x  —  X  --  cos(exp  x) 
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Figure  5.1:  Tracking  performance.  The  two  curves  are  x{t)  and  Xd{t)  =  sint. 


Figure  5.1  shows  the  resulting  tracking  performance.  Increasing  either  k  or  u 
would  cause  faster  tracking  convergence  at  the  cost  of  increased  control  activity. 

The  recommended  control  u  depends  on  Xd  and  its  derivatives  as  well  as  x 
and  its  derivatives.  We  assume  that  x  and  its  derivatives  can  be  either  directly 
observed  or  computed.  If  the  derivatives  of  Xd  are  not  available  (as  might  be 
the  case  for  example  if  the  desired  trajectory  were  specified  by  a  user  with  a 
joystick),  a  possible  fix  is  model-reference  control  In  model-reference  control, 
the  object  is  to  track  not  Xd  but  a  filtered  version  of  Xd-  The  filter  is  called  a 
reference  model,  and  its  purpose  is  twofold:  first  to  ensure  that  sufficiently  many 
derivatives  of  the  filtered  Xd  are  available,  and  second  to  ensure  that  the  filtered 
Xd  is  smooth  enough  that  it  can  be  tracked  without  unduly  large  control  inputs. 
One  common  choice  of  reference  model  is  a  low-pass  filter.  More  generally,  the 
reference  model  might  take  some  input  other  than  Xd  (for  example  derivatives 
of  Xd)  to  produce  the  filtered  Xd^ 

Now  suppose  that,  instead  of  the  exact  model  a  and  6,  we  have  a  and  b 
instead,  with  —a  <  a  —  a  <  a  and  ^  <  |  <  /3.  (The  uncertainty  bounds  a  >  0 
and  P  >  1  might  in  general  depend  on  x  and  t.)  Notice  that  this  model  of 
uncertainty  assumes  that  the  sign  of  b  is  known,  which  might  not  be  a  plausible 
assumption  in  some  domains. 

With  an  uncertain  model,  the  PD  controller  ^  —  fcs  may  no  longer  work. 
So,  we  will  use  instead  the  bang-bang  control  law  ^  =  —  A:sgn(s).  The  resulting 
choice  of  u  is  called  a  sliding  mode  control  If  we  choose  k  large  enough,  we  can 
guarantee  that  the  sliding  mode  controller  will  cause  $  to  converge  to  zero  even 
without  knowing  the  exact  a  and  b.  (With  a  small  amount  of  algebra,  we  can 
show  that  fe  =  /?(q:  -h  7?)  +  (/?  - 1)  |uo  I  is  large  enough,  where  rj  is  a.  small  positive 
number  and  uq  is  the  control  that  would  result  from  setting  ^  to  zero.)  Better 
approximations  for  a  and  b  will  allow  us  to  reduce  k  and  use  a  smaller  bang-bang 
term. 

Because  of  the  bang-bang  term  — fcsgn(s),  the  sliding  mode  control  is  dis¬ 
continuous  in  X  across  the  surface  s  =  0.  In  fact,  once  the  state  hits  s  =  0,  the 
recommended  control  w(^)  will  generally  have  infinitely  many  discontinuities  in 
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any  finite- length  time  interval.  Such  a  control  is  usually  physically  impossible 
to  implement;  so,  in  practice,  one  would  generally  interpolate  u{x)  across  a  thin 
boundary  layer  —e<s<e. 


5.3  Approximation 

All  of  the  above  methods  are  designed  to  find  exact  solutions  to  Markov  deci¬ 
sion  processes.  Because  of  this  fact,  they  are  usually  limited  to  solving  small  or 
special-case  MDPs.  On  the  other  hand,  it  is  perfectly  possible  to  run  similar 
algorithms  on  an  approximate  representation  of  the  solution  to  a  decision  prob¬ 
lem.  For  example.  Bellman  discusses  finding  approximate  value  functions  by 
quantization  and  low-order  polynomial  interpolation  in  [Belfil],  and  decompo¬ 
sition  by  orthogonal  functions  in  [BD59,  BKK63].  These  approximate  methods 
are  not  covered  by  the  convergence  proofs  for  the  exact  methods.  But,  if  they 
do  converge,  they  can  allow  us  to  find  numerical  solutions  to  problems  which 
would  otherwise  be  too  large  to  solve. 

Researchers  have  experimented  with  a  number  of  approximate  algorithms  for 
finding  value  functions.  Results  have  been  mixed:  there  have  been  notable  suc¬ 
cesses,  including  Samuels’  checkers  player  [Sam59]  and  Tesauro’s  backgammon 
player  [Tes90].  But  these  algorithms  are  notoriously  unstable;  Boyan  and  Moore 
list  several  embarrassingly  simple  situations  where  popular  algorithms  fail  miser¬ 
ably  [BM95].  Some  possible  reasons  for  these  failures  are  given  in  [TS93,  Sab93]. 

The  remainder  of  this  section  discusses  approximate  algorithms  for  solving 
MDPs.  Many  of  these  algorithms  are  modifications  of  the  exact  algorithms 
described  in  Section  5.1. 


5.3.1  State  aggregation 

The  most  straightforward  way  to  approximate  a  continuous  MDP,  and  one  of 
the  best-known,  is  to  discretize  the  state  space  into  a  grid  and  assign  the  same 
value  to  every  state  in  a  given  cell.  Similarly,  to  approximate  a  large  discrete 
MDP,  we  can  divide  the  states  into  bins  and  assign  the  same  value  to  every  state 
in  a  given  bin.  For  either  a  continuous  or  a  discrete  MDP  we  can  then  pick  one 
sample  state  from  each  bin  and  run  value  iteration  as  if  our  samples  were  the 
entire  state  space.  This  algorithm  is  a  special  case  of  fitted  value  iteration,  and 
so  has  convergence  and  error  guarantees  (see  Chapter  2).  State  aggregation  has 
been  in  use  at  least  since  the  1950s  [Belfil,  p8fi].  It  is  still  in  use  today,  often 
in  combination  with  adaptive  methods  for  determining  how  finely  to  discretize 
the  state  space  [CT89,  Moo94]. 

If  we  choose  to  divide  each  axis  of  a  d-dimensional  continuous  state  space  into 
k  partitions,  we  will  wind  up  with  k'^  states  in  our  discretization.  Unfortunately, 
even  if  we  choose  a  smallish  value  for  k  we  can  wind  up  with  a  huge  number  of 
states:  for  example,  if  we  choose  k  =  100,  a  six-dimensional  continuous  MDP 
will  translate  into  a  lO^^.g^^tg  discrete  MDP,  This  problem  is  called  the  curse 
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of  dimensionality,  since  the  number  of  states  in  the  discretization  is  exponential 
in  d. 


5.3.2  Interpolated  value  iteration 

Another  important  special  case  of  fitted  value  iteration,  dating  back  at  least  to 
Bellman’s  work  in  the  1950s  [BelGl,  p86],  is  the  class  of  interpolating  methods. 
These  methods  store  the  value  function  only  at  a  predetermined  set  of  states; 
when  the  value  of  some  other  point  is  needed  for  a  backup,  it  is  estimated  by 
some  kind  of  interpolation  scheme.  The  most  common  schemes  are  to  store  the 
values  of  states  at  the  vertices  of  a  regular  grid  and  approximate  the  values  of 
other  states  with  either  constant  interpolation  (in  which  the  value  over  an  entire 
grid  cell  is  the  same)  or  multilinear  interpolation.  Higher-order  polynomial 
interpolation  is  also  possible,  but  can  result  in  divergence. 

For  a  long  time,  grid-based  methods  with  constant  interpolation  were  the 
only  approximate  variant  of  value  iteration  that  was  known  to  converge.  The 
papers  [Gor95a,  TV94]  were,  as  far  as  we  know,  the  first  to  extend  the  proofs 
to  cover  even  multilinear  interpolation,  an  important  extension  since  better 
interpolation  methods  allow  us  to  use  coarser  grids  and  so  solve  larger  problems. 
(Davies  gives  a  two-dimensional  example  where  piecewise  constant  interpolation 
needs  about  301^  =  90601  cells  to  achieve  the  same  level  of  performance  as 
bilinear  interpolation  with  11^  =  121  cells  [Dav96].) 

5.3.3  Linear  programming 

The  most  straightforward  way  to  introduce  approximation  into  the  linear  pro¬ 
gram  representation  of  the  Bellman  equations  is  simply  to  substitute  in  an  ap¬ 
proximate  representation  for  the  value  function.  This  approach  can  work  well, 
particularly  if  we  can  represent  a  low-error  approximation  of  the  value  function. 
For  examples  of  MDPs  that  we  can  solve  this  way  see  [TZ93,  TZ95,  TZ97]. 

This  approach  has  one  important  disadvantage.  Because  we  cannot  repre¬ 
sent  the  true  value  function  exactly,  we  will  not  be  able  to  satisfy  the  Bellman 
equations  exactly.  So,  we  will  have  to  settle  for  some  errors,  that  is,  states  whose 
assigned  values  are  not  equal  to  the  backed  up  values  from  their  neighbors.  But, 
because  linear  programs  do  not  allow  their  constraints  to  be  violated,  all  of  the 
errors  in  the  linear-programming  version  of  the  Bellman  equations  will  have  the 
same  sign.  To  put  it  another  way,  the  best  approximation  to  v*  will  trade  infea¬ 
sibility  against  suboptimality,  while  the  definition  of  linear  programming  treats 
feasibility  and  optimality  asymmetrically. 

Chapter  4  discusses  in  more  detail  the  problem  of  finding  an  approximate 
value  function  by  linear  programming. 

5.3.4  Least  squares 

For  a  Markov  process,  the  Bellman  equations  reduce  to 

Ev-\-c  —  0 


115 


where  E  and  c  are  the  edge  adjacency  matrix  and  cost  vector  for  our  process. 
E  is  equal  to  P  -  /  where  P  is  the  transition  probability  matrix  for  our  process. 
See  Chapter  1  for  more  detail. 

We  can  replace  v  in  the  Bellman  equations  by  an  approximate  representation, 
say  V  =  Aw.  Here  ^  is  a  matrix  whose  columns  are  basis  vectors  for  representing 
V,  and  w  is  a,  vector  of  adjustable  parameters.  If  there  are  n  states  in  our  Markov 
process  and  we  use  k  basis  vectors  to  represent  v,  then  A  will  be  n  x  jfc.  With 
this  substitution,  the  Bellman  equations  become  EAw  +  c  =  0.  This  is  a  system 
of  n  equations  in  k  variables.  Since  in  general  k  <  n  (that  is,  since  in  general 
we  use  fewer  basis  vectors  to  represent  v  than  there  are  states  in  our  Markov 
process),  these  equations  are  overdeterminedj  so,  they  usually  do  not  have  a 
solution. 

There  are  several  ways  to  find  a  reasonable  coefficient  vector  w  in  this  sit¬ 
uation.  The  simplest  is  to  pick  k  of  the  n  equations  and  throw  away  the  rest. 
The  next  simplest  is  to  choose  w  as  the  least-squares  solution,  that  is, 

{EAfEAw  +  (EAfc  =  0 

The  vector  EAw  +  c  is  called  the  Bellman  error  or  residual,  so  the  least-squares 
solution  is  the  one  that  minimizes  sum  of  squared  Bellman  errors.  Finally  and 
most  generally  we  might  pick  an  arbitrary  nx  k  matrix  B  and  set 

B^EAw  +  B^c  =  0  (5.2) 

If  we  pick  B  =  EA^  this  method  reduces  to  least  squares;  or,  we  can  define 
a  B  that  keeps  k  of  the  n  equations  and  throws  away  the  rest  by  making  the 
columns  of  P  be  fc  of  the  n  unit  vectors  in  V' . 

One  other  choice  for  B  that  seems  to  work  well  is  P  =  DA  for  some  diagonal 
matrix  of  nonnegative  weights  D.  In  particular  we  can  set  the  diagonal  elements 
of  D  to  be  the  state  visitation  frequencies  /  given  by 

E'^f^s  =  0 

where  s  is  the  vector  of  frequencies  of  starting  in  each  state.  The  resulting 
equations  are 

Dia,g(f)[EA  +  c]  =  0  (5.3) 

This  choice  of  B  was  popularized  by  the  TD(0)  algorithm  described  below  in 
Section  5.4.1;  as  is  explained  in  more  detail  there,  TD(0)  uses  this  choice  for 
B  because  it  is  possible  to  compute  an  unbiased  estimate  of  the  coefficients  of 
Equation  5.3  by  sampling  trajectories  from  the  Markov  process. 

TD(0)  never  represents  Equation  5.3  explicitly,  but  instead  solves  it  by 
stochastic  gradient  descent.  The  algorithm  which  represents  and  solves  Equa¬ 
tion  5.3  explicitly  is  called  LS-TD,  for  Least-Squares  TD,  even  though  it  is  not 
actually  a  least  squares  algorithm.  It  is  described  in  [BB96]. 

Methods  based  on  solving  Equation  5.2  have  an  important  advantage  over 
fitted  value  iteration.  As  mentioned  in  Chapter  2,  fitted  value  iteration  applies  a 
function  approximator  over  and  over  again  to  the  same  value  function,  possibly 
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resulting  in  loss  of  accuracy.  Rather  than  approximating  the  value  function 
directly,  Equation  5.2  approximates  the  update  direction  instead.  That  is,  while 
fitted  value  iteration  computes  a  target  value  function  Tv  and  approximates 
that,  Equation  5.2  computes  the  direction  from  the  current  value  function  to 
the  target  value  function,  (T  —  I)v,  and  approximates  that  instead. 

To  see  why  this  difference  is  important,  consider  the  case  where  we  are  lucky 
enough  that  our  function  approximator  can  represent  the  optimal  value  function 
V*  perfectly.  (The  results  will  be  similar  if  we  can  only  represent  something  close 
to  t;*.)  We  pointed  out  in  Chapter  2  that  fitted  value  iteration  can  still  drift 
away  from  v*  if  we  are  using  the  wrong  kind  of  function  approximator.  On  the 
other  hand,  the  update  direction  from  v*  is  by  definition  the  zero  vector,  and 
any  linear  function  approximator  can  fit  the  zero  function  exactly.  So,  v*  will 
be  a  solution  to  Equation  5.2. 

Unfortunately,  it  is  difficult  to  generalize  Equation  5.2  to  find  approximate 
solutions  to  Markov  decision  processes:  since  the  Bellman  equations  for  MDPs 
are  nonlinear,  it  is  not  even  clear  how  to  decide  what  rank  B  to  use  to  ensure 
that  there  exists  a  solution. 

5.3.5  Collocation  and  Galkerin  methods 

Sometimes  we  can  solve  the  Hamilton- Jacobi-Bellman  equations  approximately 
by  numerical  methods.  This  section  describes  two  related  techniques  for  doing 
so,  These  techniques  work  best  when  the  HJB  equations  are  linear,  that  is, 
for  Markov  processes  instead  of  MDPs.  In  fact,  they  are  in  some  sense  the 
continuous  time  analogs  of  the  methods  in  Section  5.3.4. 

Suppose  we  wish  to  solve  a  system  of  differential  equations  numerically — say 
for  example 


=  0 

/(O)  =  1 

We  begin  by  assuming  a  simple  form  for  /(i),  say  f{t)  =  and  imposing 

the  boundary  constraint  /(O)  =  1  to  find  a  =  1.  Now  we  can  analytically 
evaluate  the  derivative  to  get 

{b  +  2ct)  -h  (1  "b  =  0  (^*4) 

For  any  given  value  of  i,  this  is  an  ordinary  algebraic  equation.  In  fact,  since  the 
both  the  original  differential  equation  and  our  approximation  to  /  are  linear, 
the  algebraic  equation  is  linear  in  b  and  c  for  each  t.  In  general  it  will  be 
impossible  to  satisfy  the  equation  for  all  since  we  have  replaced  an  arbitrary 
smooth  function  /  by  an  approximation  with  only  a  finite  number  of  degrees  of 
freedom.  So,  we  will  need  to  pick  a  reduced  set  of  equations  to  satisfy. 

There  are  several  ways  to  pick  a  reduced  set  of  equations.  The  simplest  is 
collocation  [GO 77],  in  which  we  choose  just  enough  values  of  t  from  the  interval 
of  interest  to  guarantee  a  unique  solution.  In  our  example  we  have  two  free 
parameters;  so,  since  each  collocation  point  gives  us  one  new  equation,  we  need 
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Figure  5.2:  The  solution  to  /'  +  /  =  0  along  with  two  approximations. 


to  collocate  at  two  points.  If  we  choose  ^  =  0, 1,  we  can  solve  for  the  coefficients 
b  =  -l,c  =  figure  5.2  compares  the  resulting  approximation  to  the  true 
solution  e“*  near  the  collocation  points. 

The  choice  of  collocation  points  can  influence  the  quality  of  our  result.  We 
can  reduce  the  dependence  of  the  answer  on  our  exact  choice  of  points,  and 
so  sometimes  get  a  more  accurate  approximation,  by  choosing  more  collocation 
points  than  are  strictly  necessary  and  solving  the  resulting  overdetermined  set 
of  equations  by  least  squares. 

Rather  than  a  set  of  test  points,  the  so-called  Galerkin  methods  [G077]  use 
a  set  of  test  functions  instead.  Each  test  function  specifies  a  weighted  average 
of  the  equations  for  different  values  of  t.  For  example,  if  we  choose  the  test 
functions  t  and  over  the  interval  [0, 1],  Equation  5.4  yields  the  constraints 


/  f(l  -{-  6  4-  (6  +  2c)t  +  d?)dt 
Jo 

[  ^^(1  ^  “h  (5  +  2c)^  “h  ci^)dt 

Jo 


=  0 
=  0 


which  we  can  solve  to  find  5  =  c  =  |  (see  figure  5.2). 

Galerkin  methods  are  more  general  than  collocation,  since  we  can  choose 
Dirac  ^-functions  as  our  test  functions  in  a  Galerkin  method  and  reduce  it 
to  collocation.  Just  as  in  collocation,  the  choice  of  test  functions  influences 
the  quality  of  the  resulting  approximation;  in  our  example,  we  have  followed 
common  practice  and  chosen  the  basis  functions  themselves  as  test  functions 
(recall  that  we  fixed  the  coefficient  of  1  to  satisfy  the  boundary  condition,  thus 
removing  it  from  the  basis). 

We  can  use  collocation  or  Galerkin  methods  to  find  the  value  function  of  a 
continuous-time  deterministic  Markov  chain.  If  we  assume  that  the  goal  state 
is  at  the  origin,  and  if  the  state  vector  evolves  according  to 


then  the  value  function  satisfies  the  HJB  equations 

t)(0)  =  0 

Now  suppose  that  we  choose  a  set  of  basis  functions  /3i{x),  each  with  Pi{0)  =  0, 
and  perform  a  Galerkin  approximation  using  the  basis  functions  as  test  func¬ 
tions.  A  typical  constraint  will  look  like 


^  Wi  •  fix)  +  cix) 


dx  =  0 


(5.5) 


where  S  is  the  state  space  and  Wi  is  the  weight  for  j3i.  While  this  expression  looks 
formidable,  it  is  actually  completely  analogous  to  the  unweighted  TD  equations 


AT[(P-~/)Au;-i-c]=0 


(5.6) 


(Equation  5.6  is  the  same  as  Equation  5.2  with  the  choice  B  =  A.)  Replacing 
V  by  Aw  is  analogous  to  replacing  v{x)  by  ^iWi^i{x),  with  the  2th  column 
of  A  playing  the  same  role  as  ^*.  The  term  w^i(^ft(a:))  ^  f{x)  is  the  rate 
at  which  the  value  of  the  current  state  changes  with  time,  given  that  we  are  in 
state  x;  it  is  analogous  to  a  single  component  of  the  vector  (P  —  I)  Aw  in  the  TD 
equations.  So,  the  term  in  square  brackets  in  Equation  5.5  is  analogous  to  the 
term  in  square  brackets  in  Equation  5.6.  Finally,  the  integral  is  the  continuous 
equivalent  of  a  dot  product,  so  using  the  pi  as  test  functions  in  Equation  5.5  is 
analogous  to  the  multiplication  by  in  Equation  5.6.  The  end  result  in  either 
case  is  the  same:  we  are  computing  for  each  state  the  rate  of  change  of  the  value 
function  with  time,  and  constraining  the  resulting  vector  to  be  perpendicular 
to  each  of  our  basis  functions. 

Unfortunately,  just  as  in  the  Section  5.3.4,  it  is  not  clear  how  to  generalize 
collocation  and  Galerkin  methods  from  Markov  chains  to  Markov  decision  pro¬ 
cesses.  Since  the  HJB  equation  is  in  general  nonlinear,  collocation  or  Galerkin 
methods  will  yield  a  set  of  nonlinear  algebraic  equations.  It  can  be  arbitrarily 
difficult  to  solve  these  equations;  in  fact  it  is  not  even  clear  how  many  collo¬ 
cation  points  or  test  functions  are  necessary  to  ensure  that  they  have  a  unique 
solution. 


5.3.6  Squared  Bellman  error 

In  Section  5.3.4  we  discussed  substituting  an  approximate  representation  for  the 
value  function  into  the  Bellman  equations.  In  that  section,  we  used  a  represen¬ 
tation  which  was  linear  in  its  parameters  and  we  restricted  attention  to  Markov 
processes;  the  result  was  that  we  derived  a  system  of  linear  equations  for  the 
coefficients  in  our  approximation. 

In  this  section  we  will  examine  the  more  general  case  where  we  allow  non¬ 
linear  function  approximators  such  as  neural  networks,  and  where  we  replace 
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Markov  processes  by  Markov  decision  processes.  In  this  case,  of  course,  we 
will  not  be  able  to  find  a  closed-form  solution  for  the  parameters  of  our  ap¬ 
proximation  to  the  value  function.  Instead,  we  will  need  to  rely  on  numerical 
methods. 

In  particular,  we  will  focus  on  numerical  methods  for  finding  a  local  min¬ 
imum  of  the  sum  of  squared  Bellman  error.  The  Bellman  error  vector  for  an 
approximate  value  function  v  is  defined  to  he  Tv  —  v,  where  T  is  the  parallel 
value  backup  operator  for  our  Markov  decision  process.  So,  the  sum  of  squared 
Bellman  errors  is  a  nonnegative  real- valued  function  of  the  parameters  of  our 
approximation  to  the  value  function. 

Unfortunately,  squared  Bellman  error  is  a  badly-behaved  function:  it  is 
poorly  conditioned  and  it  has  derivative  discontinuities.  Ill-conditioning  hap¬ 
pens  because  the  values  of  two  states  can  be  strongly  linked  even  if  they  are 
separated  by  many  time  steps.  (Two  states  will  be  linked  when  the  current 
policy  causes  the  agent  to  move  from  one  to  the  other  with  high  probability.)  If 
we  update  the  values  of  such  a  pair  of  states  in  opposite  directions,  the  Bellman 
error  will  change  much  more  quickly  than  if  we  update  them  in  the  same  direc¬ 
tion.  This  lack  of  condition  means  that  the  contours  of  equal  Bellman  error  are 
long  and  narrow,  so  that  simple  minimization  algorithms  like  gradient  descent 
will  be  forced  to  take  short  steps  and  converge  slowly. 

On  the  other  hand,  methods  which  are  more  robust  to  ill-conditioning,  such 
as  conjugate  gradient  and  Newton’s  method,  often  depend  on  the  smoothness 
of  the  function  to  be  minimized.  Unfortunately,  the  Bellman  error  function  can 
have  discontinuous  derivatives  even  for  linearly-parameterized  families  of  value 
functions:  there  will  usually  be  a  derivative  discontinuity  at  every  value  func¬ 
tion  for  which  there  is  more  than  one  greedy  policy.  So,  for  example,  conjugate 
gradient  can  get  caught  against  a  derivative  discontinuity  in  such  a  way  that 
none  of  its  line  searches  ever  makes  progress,  while  Newton’s  method  can  os¬ 
cillate  forever  by  stepping  back  and  forth  across  a  discontinuity.  (Interestingly, 
Newton’s  method  for  minimizing  \\Tv-v\\l  with  respect  to  v  is  identical  to  pol¬ 
icy  iteration,  so  it  is  guaranteed  to  converge;  Newton’s  method  can  only  have 
problems  when  we  substitute  an  approximation  for  v.) 

Figure  5.3  shows  several  views  of  the  Bellman  error  surface  for  a  very  simple 
MDP .  On  the  bottom  row  of  the  figure  is  the  MDP.  It  has  two  states,  so  its  value 
function  is  an  element  of  :  the  two  coordiantes  are  x  and  y,  the  estimated 
values  for  the  left-hand  and  right-hand  states  respectively.  The  top  row  of 
the  figure  shows  a  3D  and  a  contour  plot  of  the  MDP’s  error  surface:  the  x 
and  y  axes  represent  our  current  estimate  of  the  value  function,  while  the 
axis  shows  the  sum  of  squared  Bellman  errors  for  each  estimate.  These  plots 
clearly  show  the  derivative  discontinuity  that  happens  when  the  two  actions 
firom  the  right-hand  state  have  the  same  backed-up  value.  They  also  show  that 
the  contours  of  the  error  surface  near  the  global  minimum  can  be  elliptical. 
In  this  plot  the  ellipses  are  close  to  circular  and  therefore  well-conditioned, 
but  changing  the  transition  probabilities  can  give  the  contours  arbitrarily  bad 
aspect  ratios.  Finally,  the  middle  row  of  the  plot  shows  the  error  surfaces  for  two 
different  one- dimensional  slices  of  the  set  of  possible  value  functions.  These  one- 
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dimensional  slices  correspond  to  different  one-parameter  families  of  approximate 
representations  for  the  value  function.  As  the  plots  show,  it  is  easily  possible  to 
have  multiple  local  minima  or  derivative  discontinuities  at  the  minimum. 

It  may  be  possible  to  minimize  Bellman  error  efficiently  by  using  hybrid  al¬ 
gorithms,  for  example  damped  Newton  methods,  Levenberg-Marquardt,  or  gra¬ 
dient  descent  with  momentum.  Baird  has  proposed  a  promising  hybrid  method 
which  interpolates  between  temporal  differencing  (described  below)  and  gradi¬ 
ent  descent  [Bai95]. 

Even  if  we  can  find  the  parameters  which  minimize  squared  Bellman  error, 
though,  there  is  another  important  difficulty:  not  all  Bellman  errors  are  equally 
important.  In  some  MDPs,  many  optimal  paths  pass  through  one  or  a  small 
number  of  bottleneck  states.  Errors  at  the  bottlenecks  are  more  important 
than  errors  elsewhere:  at  the  bottlenecks,  a  single  error  can  affect  many  paths. 
If  we  simply  minimize  Bellman  error,  we  may  end  up  accepting  an  important 
error  at  a  bottleneck  instead  of  a  larger  but  less  important  error  at  some  other 
state.  Worse,  we  can’t  sidestep  the  problem  simply  by  weighting  errors  at 
the  bottleneck  states  more  heavily,  since  different  policies  can  have  different 
bottlenecks  and  we  won’t  know  which  states  are  the  real  bottlenecks  until  we 
have  already  found  the  optimal  policy. 

There  are  heuristic  algorithms  which  attempt  to  reweight  states  during  the 
optimization  procedure,  but  so  far  no  such  algorithm  has  been  proven  to  con¬ 
verge  for  general  function  approximators.  These  algorithms  can  perform  quite 
well  in  practice. 

5,3.7  Multi-step  methods 

The  Bellman  constraint  that  corresponds  to  a  transition  from  state  x  to  state 
y  with  cost  c  is 

<  jv{y)  H-  c 

This  constraint  relates  the  value  of  state  x  to  the  value  of  its  immediate  successor 
y.  Similarly,  the  constraint  that  corresponds  to  a  transition  from  y  to  z  with 
cost  d  is  v{y)  <  jv(z)  -|-  d.  Combining  these  two  constraints  gives 

v{x)  <  -j-jd  +  c  (5,7) 

Equation  5.7  relates  the  value  of  state  x  to  the  value  of  its  two-step  successor  z. 

We  can  combine  three  successive  one-step  constraints  to  make  a  three-step 
constraint,  four  to  make  a  four-step  constraint,  and  so  forth.  In  an  absorbing 
MDP,  we  can  go  so  far  as  to  combine  all  the  transitions  in  an  entire  trajectory 
to  make  a  single  constraint  of  the  form  v{x)  <  constant.  Such  an  inequality 
is  called  a  TD(1)  constraint,  by  analogy  to  the  TD(A)  algorithm  described  in 
Section  5.4.1.  The  advantage  of  a  TD(1)  constraint  is  that  it  is  not  recursive:  it 
constrains  the  value  of  only  one  state  rather  than  two.  That  means  that  we  can 
use  supervised  learning  algorithms  to  find  approximate  solutions  to  problems 
that  contain  only  TD(1)  constraints. 
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There  may  be  many  more  multi-step  constraints  than  there  are  one-step  ones: 
if  our  MDP  has  a  constant  number  of  actions  from  each  state,  then  (ignoring 
possible  duplicates)  the  number  of  fc-step  constraints  on  i;(a:)  is  exponential  in 
k.  (A  degenerate  case  of  this  rule  applies  to  Markov  processes.  For  a  Markov 
process  the  base  of  the  exponential  is  1,  meaning  that  there  is  exactly  one  fc-step 
constraint  on  v{x)  for  each  positive  &.)  To  avoid  dealing  with  an  exponential 
number  of  constraints,  many  practical  methods  restrict  their  attention  to  multi- 
step  constraints  for  transition  sequences  that  actually  occur  in  the  observed 
data.  For  example,  such  methods  would  ignore  the  constraint  (5.7)  unless  the 
learner  had  at  some  point  moved  from  state  x  to  state  y  to  state  z. 

Multi-step  constraints  are  redundant  if  we  plan  to  solve  the  Bellman  equa¬ 
tions  exactly.  But,  approximate  methods  for  solving  the  Bellman  equations 
may  treat  a  multi-step  constraint  differently  from  its  component  one-step  con¬ 
straints.  For  example,  for  a  Markov  process  we  can  define  a  A^-step  version  of 
Equation  5.3  that  looks  like 

Aw  =  ... 

It  is  reasonable  to  ask  whether  approximate  methods  are  likely  to  be  more 
accurate  if  they  use  one-step  or  multi-step  constraints.  As  a  rough  rule,  one- 
step  constraints  are  more  data-efficient,  while  multi-step  constraints  are  better 
at  minimizing  the  effects  of  the  function  approximator.  There  is  experimental 
evidence  [Sut88]  which  suggests  that  a  combination  of  constraints  at  different 
time  scales  works  better  than  either  single-step  constraints  or  TD(1)  constraints 
alone. 


5.3.8  Stopping  problems 

Stopping  problems  are  the  subset  of  MDPs  in  which  the  agent  has  exactly  two 
actions  at  each  state:  one  action  is  called  “continue”  and  has  an  arbitrary  effect, 
and  the  other  is  called  “stop”  and  leads  immediately  to  the  ending  state  ©. 
The  paper  [TV97]  points  out  that,  unlike  for  general  MDPs,  there  are  still  well- 
defined  state  visitation  frequencies  in  a  stopping  problem:  these  are  just  the 
frequencies  with  which  we  would  visit  the  nonterminal  states  if  we  never  chose 
the  stop  action.  So,  it  makes  sense  to  solve  the  nonlinear  equations 

D  mm{PAw  -|-  c,  d)  =  A^DAw 

where  P  is  the  transition  probability  matrix  for  continuing,  c  is  the  cost  vector 
for  continuing,  d  is  the  cost  vector  for  stopping,  D  is  the  diagonal  matrix  whose 
entries  are  the  state  visitation  frequencies,  A  is  a  matrix  whose  columns  are 
basis  vectors  for  representing  the  value  function,  and  the  minimum  operation  is 
taken  componentwise.  This  expression  is  the  analog  of  Equation  5.3.  While  the 
minimum  operation  makes  the  equations  nonlinear,  [TV97]  gives  a  convergent 
algorithm  for  finding  the  solution. 
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5.3.9  Approximate  policy  iteration 

It  is  possible  to  combine  policy  iteration  with  approximate  methods  for  finding 
value  functions.  There  are  no  such  combinations  that  have  been  proven  to 
converge  for  general  MDPs  and  function  approximators,  but  some  combinations 
seem  to  work  in  practice.  For  example,  the  experiments  in  Chapter  4  use  one 
such  algorithm,  and  another  is  described  in  [BI96]. 

5.3.10  Policies  without  values 

It  is  possible  to  learn  a  policy  directly,  without  representing  value  functions  along 
the  way.  For  example,  we  can  pick  a  starting  policy,  evaluate  it  by  following 
some  trajectories  and  measuring  the  incurred  cost,  and  try  to  modify  it  to  make 
it  better.  Methods  for  doing  so  include  gradient  descent,  simulated  annealing, 
and  genetic  algorithms. 

Unlike  simulated  annealing  and  genetic  algorithms,  gradient  descent  requires 
the  ability  to  compute  an  unbiased  estimate  of  the  gradient  of  a  parameterized 
policy’s  expected  cost  with  respect  to  one  of  its  parameters.  It  is  not  obvi¬ 
ous  that  it  is  possible  to  compute  this  gradient  without  reference  to  the  value 
function,  but  [Wil92]  gives  an  algorithm  called  Reinforce  which  does  so. 

The  advantage  of  methods  of  this  kind  is  that  they  try  directly  to  optimize 
actual  costs,  instead  of  some  proxy  for  actual  costs  like  the  consistency  of  a 
value  function.  The  disadvantage  of  these  methods  is  that  they  can  be  slow 
to  converge:  without  the  intermediate  representation  of  a  value  function,  it  is 
harder  to  decide  which  parts  of  a  policy  are  responsible  for  high  costs. 

Baird  and  Moore  [BM99]  have  recently  derived  an  algorithm  called  VAPS 
(for  value  and  policy  search)  that  can  combine  gradient  descent  on  expected 
total  cost  with  gradient  descent  on  squared  Bellman  error  or  on  other  related 
performance  measures.  Such  an  algorithm  can  use  a  value  function  to  decide 
which  parts  of  a  policy  need  modifying,  but  can  also  take  actual  costs  into 
account  directly. 

5.3.11  Linear-quadratic- Gaussian  approximations 

It  is  common  practice  to  approximate  a  nonlinear  control  problem  by  an  LQG 
problem  in  some  neighborhood.  Unfortunately,  a  single  linear-quadratic  model 
is  often  not  sufficient,  and  it  is  much  harder  to  build  a  piecewise-LQG  approx¬ 
imation  to  a  control  problem.  The  difficulty  is  in  ensuring  consistency  along 
the  edges  of  the  pieces:  the  value  function  in  each  piece  no  longer  satisfies  the 
Ricatti  equations,  since  it  depends  also  on  the  values  in  every  other  piece. 

One  approach  to  this  problem  is  to  ignore  it.  That  is,  we  can  compute  several 
separate  LQG  approximations  around  different  points,  ignoring  possible  inter¬ 
actions.  Then  we  can  control  the  system  using  the  LQG  approximation  which 
is  most  appropriate  for  the  current  operating  conditions,  or  by  interpolating 
among  several  nearby  models.  This  approach  is  called  gain  scheduling.  It  is 
particularly  effective  when  the  reward  function  is  globally  quadratic,  as  it  is  for 
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example  when  we  are  trying  to  track  a  reference  signal  as  closely  as  possible.  In 
this  case  the  LQG  models  can’t  get  confused  about  where  the  lowest  costs  are, 
but  only  about  how  to  get  there.  In  addition,  if  the  controller  does  get  stuck  far 
from  the  small  costs,  it  is  often  possible  to  unstick  it  by  hallucinating  a  series 
of  target  points  (represented  as  a  series  of  fictitious  quadratic  cost  functions) 
which  are  close  enough  together  that  the  linear-quadratic  approximations  can 
follow  them  and  which  lead  the  controller  to  a  desirable  region  of  state  space. 
Of  course,  the  question  of  which  target  points  to  use  can  be  as  difficult  as  the 
original  control  problem. 

For  control  problems  too  difficult  for  gain  scheduling,  Atkeson  has  developed 
a  method  for  growing  “spines”  backward  along  optimal  trajectories  [Atk94].  A 
spine  comprises  a  series  of  local  LQG  models;  each  model  is  locally  approxi¬ 
mately  consistent  with  the  previous  and  subsequent  models  on  the  same  spine, 
but  models  on  different  spines  do  not  interact,  so  there  are  not  too  many  de¬ 
pendencies  between  models. 

Control  methods  based  on  linearization  suffer  from  some  problems.  The 
first  is  that  they  may  require  a  large  number  of  linear  pieces,  forcing  us  either 
to  store  many  precomputed  controllers  or  to  search  for  and  generate  controllers 
as  needed  in  real  time.  The  second  and  more  important  is  that  the  system  may 
not  be  even  locally  approximable  by  an  LQG  model:  transition  functions  aren’t 
always  smooth,  errors  aren’t  always  small  and  Gaussian,  and  arbitrarily  large 
control  inputs  aren’t  always  practical. 


5.4  Incremental  algorithms 

Two  of  the  best-known  algorithms  for  finding  value  functions  are  TD(A)  and 
(J-learning  [Sut88,  Wat89].  Both  of  these  algorithms  are  incremental,  meaning 
that  they  examine  each  training  example  once  and  then  forget  it.  This  property 
may  be  useful  if  storage  space  is  at  a  premium  or  if  it  is  as  easy  to  generate  a 
new  training  example  as  it  is  to  remember  an  old  one.  Q-learning  solves  Markov 
decision  processes  but  does  not  handle  function  approximation,  while  TD(A)  can 
handle  function  approximation  but  only  solves  Markov  processes. 

5.4.1  TD(A) 

TD(A)  is  an  algorithm  that  finds  approximate  value  functions  for  Markov  pro¬ 
cesses.  (TD  is  short  for  temporal  differences,  because  the  update  for  TD(A) 
depends  on  the  difference  between  parameters  of  successive  states.)  It  can  use 
any  representation  for  value  functions  that  is  linear  in  its  coefficients;  that  is,  it 
can  represent  v  =  Aw  for  any  matrix  A  whose  columns  we  want  to  use  as  basis 
vectors. 

If  we  are  given  a  Markov  process,  it  is  possible  to  discover  the  bottleneck 
states  by  observing  actual  or  simulated  trajectories  from  the  process.  (This  is 
not  true  for  an  MDP,  since  the  bottleneck  states  depend  on  the  optimal  policy.) 
By  observing  trajectories,  we  can  build  unbiased  estimates  of  how  often  we 
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visit  each  state.  Once  we  know  the  state  visitation  frequencies,  we  can  solve 
Equation  5.3  to  find  an  approximate  value  function. 

The  TD(0)  algorithm  is  an  incremental  algorithm  which  implicitly  discov¬ 
ers  the  state  visitation  frequencies  and  solves  Equation  5.3.  After  observing  a 
transition  from  state  i  to  state  j  at  cost  c,  TD(0)  updates  its  parameter  vector 
w  by  the  rule 

w  w  +  riaiil-jUj  -tti)  -w  +  c) 

where  77  is  a  learning  rate  and  a*  and  aj  are  the  ith  and  jth  rows  of  A  expressed 
as  column  vectors.  It  is  possible  to  show  [Sut88,  Day92,  TV96]  that  under 
appropriate  conditions  TD(0)  converges  to  the  solution  of  Equation  5.3. 

TD(A)  is  a  slightly  more  complicated  algorithm  with  an  update  that  depends 
on  a  whole  sequence  of  states  instead  of  just  the  last  two.  As  the  papers  cited 
above  show,  it  converges  to  the  solution  of  an  equation  similar  to  Equation  5.3. 

There  is  no  straightforward  way  to  generalize  TD(A)  to  solve  Markov  decision 
processes.  Still,  there  are  several  popular  heuristic  MDP  algorithms  based  on 
the  method  of  temporal  diflFerences.  These  include  TD-based  variants  of  value 
iteration,  Q-learning,  policy  iteration,  and  modified  policy  iteration.  Perhaps 
the  most  successful  is  TD  value  iteration,  which  has  surfaced  for  example  in  a 
world-class  backgammon  player  [Tes94]  and  an  elevator  controller  [CB96]. 

TD-based  methods  have  the  advantage  that,  at  least  heuristically,  one  would 
expect  them  to  be  good  at  finding  bottleneck  states  because  they  always  reweight 
each  state  based  on  how  often  the  agent  encounters  it  while  following  the  current 
policy.  Unfortunately,  this  advantage  is  only  heuristic:  no  one  has  yet  found 
a  characterization  of  when  these  methods  even  converge,  much  less  a  proof 
that  they  end  up  with  reasonable  weights.  In  fact,  it  is  possible  to  construct 
examples  [Gor96,  Ber96]  where  some  of  these  methods  oscillate  forever  between 
two  or  more  policies  with  different  value  functions. 

TD-based  methods  depend  on  being  able  to  find  out  the  state-visitation 
frequencies  for  each  policy.  (In  fact,  it  is  easy  to  cause  them  to  diverge  by 
visiting  states  at  the  wrong  frequencies.)  This  fact  is  both  an  advantage  and 
a  disadvantage:  while  it  allows  TD-based  algorithms  to  take  bottleneck  states 
into  account  easily  and  naturally,  it  means  that  all  known  implementations  are 
based  on  following  trajectories  in  either  the  real  MDP  or  a  model  of  it,  which 
can  be  an  efficiency  disadvantage  compared  to  non-TD-based  algorithms. 


5.4.2  Q-learning 

It  is  difficult  to  write  an  incremental  algorithm  which  directly  learns  the  value 
function  of  a  Markov  decision  process.  The  problem  is  the  location  of  the 
nonlinearity  in  the  Bellman  equations:  if  we  write 

=  min  E  [c(a;,o)  -I-  7u(*)((5(x,o))j 

then  it  is  easy  to  get  an  unbiased  estimate  of  the  expectation  for  a  single  value 
of  0,  but  it  is  bard  to  get  an  unbiased  estiniate  of  the  minimum  over  all  a. 
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To  see  why,  imagine  taking  the  minimum  of  two  numbers,  each  corrupted  by 
zero-mean  random  noise.  The  minimum  will  be  below  the  true  minimum  if  the 
noise  in  either  number,  is  negative,  while  it  will  be  above  the  true  minimum  only 
if  both  numbers  have  positive  noise  [TS93]. 

To  solve  this  problem,  we  can  (as  suggested  in  [Wat 89])  break  the  Bellman 
equation  into  two  pieces: 

Q{x^a)  =  E[c(x,a)  'yv{5{x,a))] 
v{x)  =  mmQ{x,a) 

O' 

If  we  write 

x,a),b) 

then  it  is  easy  to  get  an  unbiased  estimate  of  the  expectation:  we  can  sample 
c  from  the  distribution  of  c(x,  a)  and  y  from  the  distribution  of  5(x,  a)  and 

compute  c -h  7  mint  ^)- 

The  Q-learning  algorithm  stores  Q  instead  of  On  each  step  it  samples  a 
transition  (say  from  state  x  to  state  y  under  action  a  at  cost  c)  and  updates 

Q(x,a)  ^  (1  -rOQ{x,a)  'y  min  Q^^\y,b)) 

b 

Under  appropriate  assumptions,  [JJS94,  Tsi94]  prove  that  Q-learning  converges 
with  probability  1  to  the  true  Q  function, 

5.5  Other  methods 

There  is  a  long  history  of  research  into  Markov  decision  processes  and  related 
problems,  and  we  have  only  summarized  a  fraction  of  it  here.  Some  interesting 
approaches  not  mentioned  above  are: 

•  Methods  which  assume  a  particular  form  of  representation  for  the  solution 
to  the  HJB  equation,  including  [DS96]  and  [Goh93]. 

•  Adaptive  control  (see,  e.g.,  [SL91]),  which  attempts  to  control  a  system 
containing  unknown  parameters  by  adapting  parameter  estimates  online. 
The  adaptation  law  may  be  chosen  to  try  to  reproduce  the  observed  dy¬ 
namics  as  accurately  as  possible  (self-tuning  control);  or,  more  directly, 
it  may  try  to  reduce  the  tracking  error  between  the  observed  trajectory 
and  the  trajectory  predicted  by  an  ideal  reference  model  (model-reference 
adaptive  control).  General  convergence  guarantees  usually  require  the 
model  to  have  some  special  form,  for  example  linear  separately  in  the 
control  inputs  and  the  unknown  parameters.  Adaptive  control  techniques 
may  be  combined  with  the  robust  sliding  mode  control  design  described 
above.  See  [OS95]  for  a  modern  example  of  an  adaptive  control  algorithm. 


Q(i+i)(^^^)  _  E  c(x, a) -|-7min(3^^^((5( 

b 
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•  Various  neural-net  approaches  based  on  “unfolding”  a  problem  by  making 
a  copy  of  the  adjustable  parameters  for  each  time  step.  After  unfolding, 
all  variable  dependencies  are  feedforward,  so  derivative  calculations  are 
simplified. 

5.6  Summary 

The  research  in  this  thesis  extends  the  state  of  the  art  in  several  ways.  To 
understand  how,  we  can  define  the  following  hierarchy  of  function  fitters.  Each 
type  of  function  approximation  algorithm  in  the  list  includes  and  generalizes 
the  previous  ones. 

Exact  A  degenerate  case.  Represents  a  function  by  storing  its  value  at  every 
possible  input. 

Piecewise  constant  Includes  grids  and  other  state  aggregation. 

Averager  As  defined  in  Chapter  2.  Includes  A^-nearest-neighbor  and  linear 
and  multilinear  interpolation. 

Linear  Linear  regression  with  an  arbitrary  basis,  including  for  example  poly¬ 
nomials,  sines  and  cosines,  and  wavelets. 

Generalized  linear  A  linear  function  with  a  monotone  transfer  function  ap¬ 
plied  to  the  output.  Includes  for  example  logistic  regression. 

General  Everything  else.  Examples  include  neural  nets  and  hierarchical  mix¬ 
tures  of  experts. 

Before  this  thesis,  the  state  of  the  art  in  learning  value  functions  for  gen¬ 
eral  MDPs  included  algorithms  that  are  guaranteed  to  converge  when  using 
exact  or  piecewise  constant  representations,  or  when  using  a  limited  subset  of 
averagers.  It  also  included  algorithms  that  use  general  representations  and  can 
work  well  in  practice,  but  are  not  guaranteed  to  converge.  And,  it  included 
algorithms  that  can’t  handle  fully-general  MDPs  but  which  can  guarantee  con¬ 
vergence  with  more-general  representations  than  averagers,  such  as  TD(A)  for 
Markov  processes,  or  analytic  solution  of  the  H  JB  equations  for  some  continuous 
control  problems.  Finally,  the  state  of  the  art  in  worst-case  learning  included 
performance  bounds  for  some  generalized  linear  functions  but  not  all. 

Chapter  2  of  this  thesis  advances  the  state  of  the  art  by  defining  an  algorithm 
with  guaranteed  convergence  that  can  represent  value  functions  with  arbitrary 
averagers.  Chapter  3  advances  the  state  of  the  art  by  extending  worst-case 
regret  bounds  to  cover  a  larger  fraction  of  generalized  linear  function  approxi¬ 
mators.  Finally,  Chapter  4  takes  the  first  steps  towards  an  algorithm  that  can 
use  arbitrary  linear  function  approximators  to  represent  value  functions. 

During  the  course  of  this  thesis,  other  researchers  have  (of  course)  also  ad¬ 
vanced  the  state  of  the  art  in  finding  value  functions.  Of  note  are  [TV94],  which 
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duplicated  some  of  the  results  in  Chapter  2;  [SJJ95],  which  described  an  on¬ 
line  algorithm  related  to  fitted  value  iteration;  [TV97],  which  extended  TD(A) 
to  handle  stopping  problems;  [MM98],  which  described  a  kind  of  averager  that 
converges  to  the  exact  value  function  (in  the  limit  of  increasing  representa¬ 
tional  power)  when  approximating  a  continuous-time  MDP;  and  [Bai95]  and 
[BM99],  which  developed  gradient-descent  style  algorithms  that  are  guaranteed 
to  converge  at  least  to  a  local  maximum  when  using  (differentiable)  general 
representations. 
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Chapter  6 


SUMMARY  OF 
CONTRIBUTIONS 
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Finding  approximate  value  functions  for  Markov  decision  processes  is  im¬ 
portant  because  it  addresses  a  basic  need  in  machine  learning:  the  need  for  the 
learner  to  find  reasonable  sequences  of  actions  despite  complicated,  probabilistic 
environments.  This  thesis  has  presented  three  threads  of  research  all  motivated 
by  the  goal  of  approximating  value  functions. 

The  contributions  of  the  research  on  fitted  value  iteration  are  to  discover  a 
class  of  function  approximators  that  is  compatible  with  fitted  value  iteration;  to 
derive  convergence  and  error  bounds  for  fitted  value  iteration  using  approxima¬ 
tors  in  this  class;  to  reduce  fitted  value  iteration  to  exact  value  iteration  on  an 
embedded  process;  and  to  perform  experiments  demonstrating  that  fitted  value 
iteration  is  capable  of  solving  Markov  decision  processes  that  require  complex 
pattern  recognition. 

The  contributions  of  the  research  on  worst-case  learning  are  to  provide  a 
framework  in  which  to  prove  regret  bounds  for  a  wide  variety  of  learning  al¬ 
gorithms  and  to  apply  this  framework  to  bring  together  known  regret  bounds 
and  prove  new  ones.  While  we  have  not  proven  any  bounds  specifically  about 
the  problem  of  solving  Markov  decision  processes,  we  expect  that  the  results  of 
this  research  will  be  helpful  in  proving  such  bounds,  because  the  information 
available  to  a  learner  about  an  MDP  is  often  not  in  the  form  of  a  sample  of 
independent  identically  distributed  random  variables. 

The  contributions  of  the  research  on  solving  Markov  decision  processes  by 
convex  programming  are  to  explore  the  connection  among  MDPs,  convex  opti¬ 
mization,  and  statistical  estimation;  to  propose  a  new  way  to  design  algorithms 
for  approximating  value  functions;  and  to  experiment  with  new  algorithms  built 
according  to  this  design.  While  the  new  algorithms  do  not  improve  on  the  best 
existing  methods  for  approximating  value  functions,  they  do  demonstrate  that 
the  design  holds  the  promise  of  avoiding  some  of  the  shortcomings  of  current 
value  function  approximation  methods. 

These  three  threads  of  research  work  together  to  advance  the  state  of  the  art 
in  finding  approximate  solutions  to  Markov  decision  processes.  Together  they 
provide  a  wide  variety  of  new  tools  for  designing  algorithms  that  allow  learners 
to  act  appropriately  in  complicated,  uncertain  environments. 
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